[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects
jorisvandenbossche commented on a change in pull request #7631: URL: https://github.com/apache/arrow/pull/7631#discussion_r450390113 ## File path: python/pyarrow/tests/test_dataset.py ## @@ -612,6 +613,83 @@ def test_make_fragment(multisourcefs): assert row_group_fragment.row_groups == [ds.RowGroupInfo(0)] +def test_make_csv_fragment_from_buffer(): +content = textwrap.dedent(""" +alpha,num,animal +a,12,dog +b,11,cat +c,10,rabbit +""") +buffer = pa.py_buffer(content.encode('utf-8')) + +csv_format = ds.CsvFileFormat() +fragment = csv_format.make_fragment(buffer) + +expected = pa.table([['a', 'b', 'c'], + [12, 11, 10], + ['dog', 'cat', 'rabbit']], +names=['alpha', 'num', 'animal']) +assert fragment.to_table().equals(expected) + +pickled = pickle.loads(pickle.dumps(fragment)) +assert pickled.to_table().equals(fragment.to_table()) + + +@pytest.mark.parquet +def test_make_parquet_fragment_from_buffer(): +import pyarrow.parquet as pq + +cases = [ +( +pa.table( +[ +['a', 'b', 'c'], +[12, 11, 10], +['dog', 'cat', 'rabbit'] Review comment: you could reduce the used vertical space here a bit by only defining the list of arrays here, and do `table = pa.table(arrays, names=['alpha', 'num', 'animal'])` in the loop This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects
jorisvandenbossche commented on a change in pull request #7631: URL: https://github.com/apache/arrow/pull/7631#discussion_r449567025 ## File path: python/pyarrow/tests/test_dataset.py ## @@ -635,6 +635,37 @@ def test_make_fragment_from_buffer(): assert pickled.to_table().equals(fragment.to_table()) +@pytest.mark.parquet +def test_make_parquet_fragment_from_buffer(): +import pyarrow.parquet as pq + +table = pa.table([['a', 'b', 'c'], + [12, 11, 10], + ['dog', 'cat', 'rabbit']], + names=['alpha', 'num', 'animal']) + +out = pa.BufferOutputStream() +pq.write_table(table, out) + +buffer = out.getvalue() + +formats = [ +ds.ParquetFileFormat(), +ds.ParquetFileFormat( +read_options=ds.ParquetReadOptions( +use_buffered_stream=True, +buffer_size=4096, Review comment: we probably need to use an option that actually alters the output to be able to catch a failure, eg `dictionary_columns` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects
jorisvandenbossche commented on a change in pull request #7631: URL: https://github.com/apache/arrow/pull/7631#discussion_r449556844 ## File path: python/pyarrow/_dataset.pyx ## @@ -773,6 +789,14 @@ cdef class FileFragment(Fragment): Fragment.init(self, sp) self.file_fragment = sp.get() +def __reduce__(self): +buffer = self.buffer +return self.format.make_fragment, ( Review comment: By specifying here the method on a format object (`format.make_fragment`), it also automatically pickles the `format` *instance* ? ## File path: python/pyarrow/_dataset.pyx ## @@ -887,6 +911,14 @@ cdef class ParquetFileFragment(FileFragment): FileFragment.init(self, sp) self.parquet_file_fragment = sp.get() +def __reduce__(self): +return self.format.make_fragment, ( +self.path, Review comment: I suppose you might need to do the same `self.path if buffer is None else buffer,` here as you did for `FileFragment` ? ## File path: python/pyarrow/_dataset.pyx ## @@ -773,6 +789,14 @@ cdef class FileFragment(Fragment): Fragment.init(self, sp) self.file_fragment = sp.get() +def __reduce__(self): +buffer = self.buffer +return self.format.make_fragment, ( Review comment: We should maybe ensure this with testing a picking roundtrip for a case that specified read params in the ParquetFileFormat object This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org