ldacey commented on PR #37868:
URL: https://github.com/apache/arrow/pull/37868#issuecomment-1913666960
Yes, I changed fake_size to -9999 and reran it and it still worked. But
since I am not using S3 (only have access to GCS and ADLSgen2) perhaps it is
just ignored entirely.
```python
ic| pa.__version__: '15.0.0'
ic| fs: <gcsfs.core.GCSFileSystem object at 0x7f55e8b88a10>
ic| dataset.filesystem: <pyarrow._fs.PyFileSystem object at 0x7f562f099230>
ic| table.num_rows: 23491
ic| dataset.to_table().equals(original_dataset.to_table()): True
ic| actual_size: 4237841, fake_size: -9999
ic| dataset.to_table().equals(test.to_table()): True
```
My original plan was to take a look at deltalake (delta-rs library) which
already uses `make_fragments()`. Since the transaction log
(`get_add_actions()`) has the actual file sizes then we could pass these to
`make_fragments()` for some potential efficiency correct?
```python
if not filesystem or pyarrow.__version__ >= "15.0":
file_sizes = self.get_add_actions().to_pydict()
file_sizes = {
x: y for x, y in zip(file_sizes["path"],
file_sizes["size_bytes"])
}
format = ParquetFileFormat(
read_options=parquet_read_options,
default_fragment_scan_options=ParquetFragmentScanOptions(pre_buffer=True),
)
fragments = []
for file, part_expression in self._table.dataset_partitions(
self.schema().to_pyarrow(), partitions
):
if pyarrow.__version__ >= "15.0":
fragment = format.make_fragment(
file,
filesystem=filesystem,
partition_expression=part_expression,
file_size=file_sizes[file],
)
else:
fragment = format.make_fragment(
file,
filesystem=filesystem,
partition_expression=part_expression,
)
fragments.append(fragment)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]