ldacey commented on PR #37868:
URL: https://github.com/apache/arrow/pull/37868#issuecomment-1913652787
Here are some more details:
```python
ic(pa.__version__)
ic(fs)
dataset = ds.dataset(gcs_path, filesystem=fs)
ic(dataset.filesystem)
table = dataset.to_table()
ic(table.num_rows)
file_format = ds.ParquetFileFormat()
paths = dataset.files
original_fragments = [frag for frag in dataset.get_fragments()]
original_dataset = ds.FileSystemDataset(
original_fragments, format=file_format, schema=table.schema,
filesystem=dataset.filesystem
)
assert dataset.to_table().equals(original_dataset.to_table())
ic(dataset.to_table().equals(original_dataset.to_table()))
wrong_fragments = []
for path in dataset.files:
fake_size = 55555555555
actual_size = dataset.filesystem.get_file_info(path).size
ic(actual_size, fake_size)
fragment = file_format.make_fragment(path,
filesystem=dataset.filesystem, file_size=fake_size)
wrong_fragments.append(fragment)
test = ds.FileSystemDataset(
wrong_fragments, format=file_format, schema=table.schema,
filesystem=dataset.filesystem
)
assert dataset.to_table().equals(test.to_table())
ic(dataset.to_table().equals(test.to_table()))
___
ic| pa.__version__: '15.0.0'
ic| fs: <gcsfs.core.GCSFileSystem object at 0x7f13ea37dd90>
ic| dataset.filesystem: <pyarrow._fs.PyFileSystem object at 0x7f13b5b2d130>
ic| table.num_rows: 23491
ic| dataset.to_table().equals(original_dataset.to_table()): True
ic| actual_size: 4237841, fake_size: 55555555555
ic| dataset.to_table().equals(test.to_table()): True
```
I was using dataset.filesystem which is a pyarrow filesystem (generated from
the gcsfs filesystem). In that example, I made the file size some random large
number and you can see it does not match the real size yet the table constructs
and still matches the original table. I thought that would fail, unless I am
doing something wrong. I have pyarrow 15 installed though.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]