adamreeve commented on PR #44633: URL: https://github.com/apache/arrow/pull/44633#issuecomment-2840721451
It would be nice to get this feature added to speed up field and column lookups, but from my understanding there are two main issues: * This doesn't account for duplicate field names * It adds more technical debt to the inconsistent string comparisons issue by continuing the use of `StringComparer.CurrentCulture` (#44650) I think Curt's suggestion to deprecate exising APIs and introduce new better APIs (https://github.com/apache/arrow/issues/44650#issuecomment-2558246991) makes sense. This would be fairly disruptive though as it would also require deprecating `IArrowRecord.Column` and its implementations in `RecordBatch` and `StructArray`, but that's probably still better than making a breaking behaviour change. How about this as a potential way forward? * Mark `Schema.GetFieldIndex(string name, IEqualityComparer<string> comparer = default)` and `IRecordType.GetFieldIndex` as obsolete * Add a new `Schema` and `IRecordType` method that uses `StringComparer.Ordinal` by default. I can't think of a great name, some options might be `FieldIndex`, `LookupFieldIndex`, `GetFirstFieldIndex` or`GetIndexForField` * Add a new `IEnumerable<int> Schema.GetFieldIndices(string name, IEqualityComparer<string> comparer = default)` method that also uses `StringComparer.Ordinal` by default to support the duplicate field names use case. * Document all the relevant `Schema` methods and properties to mention what type of string comparison they do by default * Mark `RecordBatch.Column(string columnName)`, `RecordBatch.Column(string columnName, IEqualityComparer<string> comparer)` and `IArrowRecord.Column(string columnName, IEqualityComparer<string> comparer = default)` as obsolote. * Add new `RecordBatch` and `IArrowRecord` methods that use the new `Schema` methods that default to `StringComparer.Ordinal`. Again I'm not sure what a good name would be, maybe `LookupColumn`, `GetColumn` or `GetColumnByName` * If we're making new methods anyway, we should throw `KeyNotFoundException` or a new custom exception type (`FieldNotFoundException`?) rather than `InvalidOperationException` in the case a field isn't found. * Optionally make `_fieldsIndexLookup` initialized lazily so it could be used by the new and old methods. It would probably make sense to support comparers other than `Ordinal` and `CurrentCulture` too. This could get complicated though, we'd need to consider thread safety and what happens if the same schema is queried with one comparer and then a different comparer. I can imagine that happening when code that uses the new and obsolete methods is mixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org