Re: [PR] GH-37857: [Python][Dataset] Expose file size to python dataset [arrow]

via GitHub Sun, 28 Jan 2024 09:23:52 -0800


ldacey commented on PR #37868:
URL: https://github.com/apache/arrow/pull/37868#issuecomment-1913666960


   Yes, I changed fake_size to -9999 and reran it and it still worked. But 
since I am not using S3 (only have access to GCS and ADLSgen2) perhaps it is 
just ignored entirely.
   
   ```python
   ic| pa.__version__: '15.0.0'
   ic| fs: <gcsfs.core.GCSFileSystem object at 0x7f55e8b88a10>
   ic| dataset.filesystem: <pyarrow._fs.PyFileSystem object at 0x7f562f099230>
   ic| table.num_rows: 23491
   ic| dataset.to_table().equals(original_dataset.to_table()): True
   ic| actual_size: 4237841, fake_size: -9999
   ic| dataset.to_table().equals(test.to_table()): True
   ```
   
   My original plan was to take a look at deltalake (delta-rs library) which 
already uses `make_fragments()`. Since the transaction log 
(`get_add_actions()`) has the actual file sizes then we could pass these to 
`make_fragments()` for some potential efficiency correct?
   
   ```python
           if not filesystem or pyarrow.__version__ >= "15.0":
               file_sizes = self.get_add_actions().to_pydict()
               file_sizes = {
                   x: y for x, y in zip(file_sizes["path"], 
file_sizes["size_bytes"])
               }
   
           format = ParquetFileFormat(
               read_options=parquet_read_options,
               
default_fragment_scan_options=ParquetFragmentScanOptions(pre_buffer=True),
           )
   
           fragments = []
           for file, part_expression in self._table.dataset_partitions(
               self.schema().to_pyarrow(), partitions
           ):
               if pyarrow.__version__ >= "15.0":
                   fragment = format.make_fragment(
                       file,
                       filesystem=filesystem,
                       partition_expression=part_expression,
                       file_size=file_sizes[file],
                   )
               else:
                   fragment = format.make_fragment(
                       file,
                       filesystem=filesystem,
                       partition_expression=part_expression,
                   )
               fragments.append(fragment)
   ```
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-37857: [Python][Dataset] Expose file size to python dataset [arrow]

Reply via email to