nevi-me commented on pull request #8715: URL: https://github.com/apache/arrow/pull/8715#issuecomment-730640190
> My feeling is that if we need to introduce a different comparison, this often hints that there is useless information on the `DataType` that we should eliminate. If it is useless but needs to stay for some reason, then my suggestion is that we implement a custom `PartialEq` that ignores it, so that there is a common source of truth about whether two datatypes are equal. @jorgecarleitao I've also checked the specification, and it's unclear on what should happen. It looks like the C++ implementation doesn't check the field names (https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.cc). I made the change for IPC integration purposes, as we were failing tests because of field names. I think not checking the names when constructing record batches is fine in that regard (as the tests still pass). I'm trying to think of what the broader implications of this change would be. I'm so far leaning on saying "we need not check the field name when constructing record batches, but for lists and structs, we should continue checking nullability of the `Field`. When writing Parquet files, having one batch's column being nullable, then the next batch's not being nullable, would break some things. I haven't looked at the code yet, but if the change can be isolated to only be used in the `RecordBatch::try_new()` part (at least for now), then it should be a reasonable change. Interestingly, I went down this (or similar route) with data type equality, before realising that we needed a `Field` on lists. One last matter, which might not be a concern, is that in the specification, a `Field` can carry metadata. So, in what conditions would we need to compare field metadata? I haven't looked at `ExtensionArray` yet, but I think the field metadata is used there. > I would personally suggest one of the following: > > 1. Keep the comparison code as is (and include the field name) as there is some reasonable story that it is "part" of the type > 2. Update the `PartialCmp` implementation for `DataType` to ignore Field names for all `DataType` == `DataType` comparisons I like this approach from @alamb ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
