eeroel commented on PR #37868:
URL: https://github.com/apache/arrow/pull/37868#issuecomment-1913656561

   > Here are some more details:
   > 
   > ```python
   > ic(pa.__version__)
   > ic(fs)
   > dataset = ds.dataset(gcs_path, filesystem=fs)
   > ic(dataset.filesystem)
   > 
   > table = dataset.to_table()
   > ic(table.num_rows)
   > 
   > file_format = ds.ParquetFileFormat()
   > paths = dataset.files
   > 
   > original_fragments = [frag for frag in dataset.get_fragments()]
   > original_dataset = ds.FileSystemDataset(
   >     original_fragments, format=file_format, schema=table.schema, 
filesystem=dataset.filesystem
   > )
   > assert dataset.to_table().equals(original_dataset.to_table())
   > ic(dataset.to_table().equals(original_dataset.to_table()))
   > 
   > wrong_fragments = [] 
   > for path in dataset.files:
   >     fake_size = 55555555555
   >     actual_size = dataset.filesystem.get_file_info(path).size
   >     ic(actual_size, fake_size)
   >     fragment = file_format.make_fragment(path, 
filesystem=dataset.filesystem, file_size=fake_size)
   >     wrong_fragments.append(fragment)
   >     
   > test = ds.FileSystemDataset(
   >     wrong_fragments, format=file_format, schema=table.schema, 
filesystem=dataset.filesystem
   > )
   > 
   > assert dataset.to_table().equals(test.to_table())
   > ic(dataset.to_table().equals(test.to_table()))
   > 
   > ___
   > 
   > ic| pa.__version__: '15.0.0'
   > ic| fs: <gcsfs.core.GCSFileSystem object at 0x7f13ea37dd90>
   > ic| dataset.filesystem: <pyarrow._fs.PyFileSystem object at 0x7f13b5b2d130>
   > ic| table.num_rows: 23491
   > ic| dataset.to_table().equals(original_dataset.to_table()): True
   > ic| actual_size: 4237841, fake_size: 55555555555
   > ic| dataset.to_table().equals(test.to_table()): True
   > ```
   > 
   > I was using dataset.filesystem which is a pyarrow filesystem (generated 
from the gcsfs filesystem). In that example, I made the file size some random 
large number and you can see it does not match the real size yet the table 
constructs and still matches the original table. I thought that would fail, 
unless I am doing something wrong. I have pyarrow 15 installed though.
   
   Do you also get this result if you set `fake_size` to a negative number? 
With a too-large size this is expected, because the size is not actually used 
in this case (it's only used with Arrow S3fs implementation at the moment).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to