nevi-me commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-730640190


   > My feeling is that if we need to introduce a different comparison, this 
often hints that there is useless information on the `DataType` that we should 
eliminate. If it is useless but needs to stay for some reason, then my 
suggestion is that we implement a custom `PartialEq` that ignores it, so that 
there is a common source of truth about whether two datatypes are equal.
   
   @jorgecarleitao I've also checked the specification, and it's unclear on 
what should happen. It looks like the C++ implementation doesn't check the 
field names 
(https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.cc).
   
   I made the change for IPC integration purposes, as we were failing tests 
because of field names.
   I think not checking the names when constructing record batches is fine in 
that regard (as the tests still pass).
   
   I'm trying to think of what the broader implications of this change would be.
   
   I'm so far leaning on saying "we need not check the field name when 
constructing record batches, but for lists and structs, we should continue 
checking nullability of the `Field`. When writing Parquet files, having one 
batch's column being nullable, then the next batch's not being nullable, would 
break some things.
   
   I haven't looked at the code yet, but if the change can be isolated to only 
be used in the `RecordBatch::try_new()` part (at least for now), then it should 
be a reasonable change.
   
   Interestingly, I went down this (or similar route) with data type equality, 
before realising that we needed a `Field` on lists.
   
   One last matter, which might not be a concern, is that in the specification, 
a `Field` can carry metadata. So, in what conditions would we need to compare 
field metadata? I haven't looked at `ExtensionArray` yet, but I think the field 
metadata is used there.
   
   > I would personally suggest one of the following:
   > 
   > 1. Keep the comparison code as is (and include the field name) as there is 
some reasonable story that it is "part" of the type
   > 2. Update the `PartialCmp` implementation for `DataType` to ignore Field 
names for all `DataType` == `DataType` comparisons
   
   I like this approach from @alamb 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to