ldacey commented on PR #37868:
URL: https://github.com/apache/arrow/pull/37868#issuecomment-1913652787

   Here are some more details:
   
   ```python
   ic(pa.__version__)
   ic(fs)
   dataset = ds.dataset(gcs_path, filesystem=fs)
   ic(dataset.filesystem)
   
   table = dataset.to_table()
   ic(table.num_rows)
   
   file_format = ds.ParquetFileFormat()
   paths = dataset.files
   
   original_fragments = [frag for frag in dataset.get_fragments()]
   original_dataset = ds.FileSystemDataset(
       original_fragments, format=file_format, schema=table.schema, 
filesystem=dataset.filesystem
   )
   assert dataset.to_table().equals(original_dataset.to_table())
   ic(dataset.to_table().equals(original_dataset.to_table()))
   
   wrong_fragments = [] 
   for path in dataset.files:
       fake_size = 55555555555
       actual_size = dataset.filesystem.get_file_info(path).size
       ic(actual_size, fake_size)
       fragment = file_format.make_fragment(path, 
filesystem=dataset.filesystem, file_size=fake_size)
       wrong_fragments.append(fragment)
       
   test = ds.FileSystemDataset(
       wrong_fragments, format=file_format, schema=table.schema, 
filesystem=dataset.filesystem
   )
   
   assert dataset.to_table().equals(test.to_table())
   ic(dataset.to_table().equals(test.to_table()))
   
   ___
   
   ic| pa.__version__: '15.0.0'
   ic| fs: <gcsfs.core.GCSFileSystem object at 0x7f13ea37dd90>
   ic| dataset.filesystem: <pyarrow._fs.PyFileSystem object at 0x7f13b5b2d130>
   ic| table.num_rows: 23491
   ic| dataset.to_table().equals(original_dataset.to_table()): True
   ic| actual_size: 4237841, fake_size: 55555555555
   ic| dataset.to_table().equals(test.to_table()): True
   
   ```
   
   I was using dataset.filesystem which is a pyarrow filesystem (generated from 
the gcsfs filesystem). In that example, I made the file size some random large 
number and you can see it does not match the real size yet the table constructs 
and still matches the original table.  I thought that would fail, unless I am 
doing something wrong. I have pyarrow 15 installed though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to