pitrou commented on PR #49451: URL: https://github.com/apache/arrow/pull/49451#issuecomment-3999163862
> Shouldn't the program already reject the input in this case? What is the purpose of the duplicate definition otherwise? An IPC stream is meant to be read sequentially and therefore has the schema appearing at the start of the encoded stream. An IPC file is basically an IPC stream + a file footer with dedicated metadata for random access (a bit like a ZIP file catalog). The IPC file footer contains a copy of the schema to reduce the number of required IOs to read into the file. The IPC file reader reads directly from the end of file, ignoring the schema that is stored at the start of the encoded IPC stream. Validating that the two schemas are identical would do a spurious IO while correct files would have identical schemas anyway. > Also, would it be possible to "fix" the footer to match the original schema? Ah, you mean use the same schema when comparing the contents? There's no way to tell the IPC file reader API to use a different schema for reading, because it doesn't make sense with valid IPC files. Moreover, in some cases the different schema will not matter because only a field name changed, but as soon as a more important piece of information has changed (for example a field type, or an additional field etc.), then passing the wrong schema to the reader will just fail or return gibberish. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
