adamreeve commented on PR #44633:
URL: https://github.com/apache/arrow/pull/44633#issuecomment-2840721451

   It would be nice to get this feature added to speed up field and column 
lookups, but from my understanding there are two main issues:
   * This doesn't account for duplicate field names
   * It adds more technical debt to the inconsistent string comparisons issue 
by continuing the use of `StringComparer.CurrentCulture` (#44650)
   
   I think Curt's suggestion to deprecate exising APIs and introduce new better 
APIs (https://github.com/apache/arrow/issues/44650#issuecomment-2558246991) 
makes sense. This would be fairly disruptive though as it would also require 
deprecating `IArrowRecord.Column` and its implementations in `RecordBatch` and 
`StructArray`, but that's probably still better than making a breaking 
behaviour change.
   
   How about this as a potential way forward?
   * Mark `Schema.GetFieldIndex(string name, IEqualityComparer<string> comparer 
= default)` and `IRecordType.GetFieldIndex` as obsolete
   * Add a new `Schema` and  `IRecordType` method that uses 
`StringComparer.Ordinal` by default. I can't think of a great name, some 
options might be `FieldIndex`, `LookupFieldIndex`, `GetFirstFieldIndex` 
or`GetIndexForField`
   * Add a new `IEnumerable<int> Schema.GetFieldIndices(string name, 
IEqualityComparer<string> comparer = default)` method that also uses 
`StringComparer.Ordinal` by default to support the duplicate field names use 
case.
   * Document all the relevant `Schema` methods and properties to mention what 
type of string comparison they do by default
   * Mark `RecordBatch.Column(string columnName)`, `RecordBatch.Column(string 
columnName, IEqualityComparer<string> comparer)` and 
`IArrowRecord.Column(string columnName, IEqualityComparer<string> comparer = 
default)` as obsolote.
   * Add new `RecordBatch` and `IArrowRecord` methods that use the new `Schema` 
methods that default to `StringComparer.Ordinal`. Again I'm not sure what a 
good name would be, maybe `LookupColumn`, `GetColumn` or `GetColumnByName`
   * If we're making new methods anyway, we should throw `KeyNotFoundException` 
or a new custom exception type (`FieldNotFoundException`?) rather than 
`InvalidOperationException` in the case a field isn't found.
   * Optionally make `_fieldsIndexLookup` initialized lazily so it could be 
used by the new and old methods. It would probably make sense to support 
comparers other than `Ordinal` and `CurrentCulture` too. This could get 
complicated though, we'd need to consider thread safety and what happens if the 
same schema is queried with one comparer and then a different comparer. I can 
imagine that happening when code that uses the new and obsolete methods is 
mixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to