eeroel commented on PR #37868: URL: https://github.com/apache/arrow/pull/37868#issuecomment-1913513875
> Should this fail with bad file_size values? I tried changing the file_size to -10000 and it still succeed for me. I am not sure if GCS uses this information though (I am using fsspec for the filesystem which uses the gcsfs library). I think I am using it wrong because I cannot get it to fail in general, regardless of whether the size I input is the real size of the file. Did you also try to create a dataset with those fragments and read it? There's no validation when the fragments are constructed, but it should fail when the parquet reader tries to start reading the file, in here: https://github.com/apache/arrow/blob/21ffd82c05c93b873ae3c27128eb8604ed0c735f/cpp/src/parquet/file_reader.cc#L476. It would make sense to handle zero and negative sizes on the Python side though... Regarding fsspec, the file size information will only get used for Arrow internal file system implementations, and I believe currently it's only used for S3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
