subject:"\[GitHub\] \[arrow\] jorisvandenbossche commented on a change in pull request #7631\: ARROW\-8651\: \[Python\]\[Dataset\] Support pickling of Dataset objects"

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

2020-07-06 Thread GitBox



jorisvandenbossche commented on a change in pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#discussion_r450390113



##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -612,6 +613,83 @@ def test_make_fragment(multisourcefs):
 assert row_group_fragment.row_groups == [ds.RowGroupInfo(0)]
 
 
+def test_make_csv_fragment_from_buffer():
+content = textwrap.dedent("""
+alpha,num,animal
+a,12,dog
+b,11,cat
+c,10,rabbit
+""")
+buffer = pa.py_buffer(content.encode('utf-8'))
+
+csv_format = ds.CsvFileFormat()
+fragment = csv_format.make_fragment(buffer)
+
+expected = pa.table([['a', 'b', 'c'],
+ [12, 11, 10],
+ ['dog', 'cat', 'rabbit']],
+names=['alpha', 'num', 'animal'])
+assert fragment.to_table().equals(expected)
+
+pickled = pickle.loads(pickle.dumps(fragment))
+assert pickled.to_table().equals(fragment.to_table())
+
+
+@pytest.mark.parquet
+def test_make_parquet_fragment_from_buffer():
+import pyarrow.parquet as pq
+
+cases = [
+(
+pa.table(
+[
+['a', 'b', 'c'],
+[12, 11, 10],
+['dog', 'cat', 'rabbit']

Review comment:
   you could reduce the used vertical space here a bit by only defining the 
list of arrays here, and do `table = pa.table(arrays, names=['alpha', 'num', 
'animal'])` in the loop





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

2020-07-03 Thread GitBox



jorisvandenbossche commented on a change in pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#discussion_r449567025



##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -635,6 +635,37 @@ def test_make_fragment_from_buffer():
 assert pickled.to_table().equals(fragment.to_table())
 
 
+@pytest.mark.parquet
+def test_make_parquet_fragment_from_buffer():
+import pyarrow.parquet as pq
+
+table = pa.table([['a', 'b', 'c'],
+  [12, 11, 10],
+  ['dog', 'cat', 'rabbit']],
+ names=['alpha', 'num', 'animal'])
+
+out = pa.BufferOutputStream()
+pq.write_table(table, out)
+
+buffer = out.getvalue()
+
+formats = [
+ds.ParquetFileFormat(),
+ds.ParquetFileFormat(
+read_options=ds.ParquetReadOptions(
+use_buffered_stream=True,
+buffer_size=4096,

Review comment:
   we probably need to use an option that actually alters the output to be 
able to catch a failure, eg `dictionary_columns`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

2020-07-03 Thread GitBox



jorisvandenbossche commented on a change in pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#discussion_r449556844



##
File path: python/pyarrow/_dataset.pyx
##
@@ -773,6 +789,14 @@ cdef class FileFragment(Fragment):
 Fragment.init(self, sp)
 self.file_fragment =  sp.get()
 
+def __reduce__(self):
+buffer = self.buffer
+return self.format.make_fragment, (

Review comment:
   By specifying here the method on a format object 
(`format.make_fragment`), it also automatically pickles the `format` *instance* 
?

##
File path: python/pyarrow/_dataset.pyx
##
@@ -887,6 +911,14 @@ cdef class ParquetFileFragment(FileFragment):
 FileFragment.init(self, sp)
 self.parquet_file_fragment =  sp.get()
 
+def __reduce__(self):
+return self.format.make_fragment, (
+self.path,

Review comment:
   I suppose you might need to do the same `self.path if buffer is None 
else buffer,` here as you did for `FileFragment` ?

##
File path: python/pyarrow/_dataset.pyx
##
@@ -773,6 +789,14 @@ cdef class FileFragment(Fragment):
 Fragment.init(self, sp)
 self.file_fragment =  sp.get()
 
+def __reduce__(self):
+buffer = self.buffer
+return self.format.make_fragment, (

Review comment:
   We should maybe ensure this with testing a picking roundtrip for a case 
that specified read params in the ParquetFileFormat object





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

3 matches

Site Navigation

Mail list logo

Footer information