arvidjonasson opened a new issue, #48977: URL: https://github.com/apache/arrow/issues/48977
### Describe the bug, including details regarding any error messages, version, and platform. `arrow::schema` construction performance degrades to quadratic time on libc++ when there's a lot of fields with duplicate or no names. This is most critical when creating wide schemas programmatically. The performance degradation is caused by the function [CreateNameToIndexMap](https://github.com/apache/arrow/blob/75ef03165dc7e54aca0f842f54598833ddbd0a01/cpp/src/arrow/type.cc#L1332-L1339). The function calls `std::unordered_multimap::emplace` once per field. On libc++, this function call has linear runtime (amortized) for duplicate keys because it iterates past every existing entry to emplace the new element at the end of the range (see LLVM bugzilla [16747](https://bugs.llvm.org/show_bug.cgi?id=16747) and [21275](https://bugs.llvm.org/show_bug.cgi?id=21275)). Some performance benchmarks creating a schema with N int32 fields: | N | Time (s) |Growth factor (vs previous input) | | - | - | - | | 500 | 0.000464958s | NA | | 5000 | 0.041429s | 89.1x | | 50000 | 4.12813s | 99.6x | | 500000 | 416.985s | 101.0x | One fix is to use `map.emplace_hint(map.find(key), ...)` as recommended by libc++ maintainers. This avoids the linear scan but may reverse order for duplicates. Alternatively, using `std::unordered_map<std::string_view, std::vector<int>>` keeps the order but is a slightly larger refactor. The same benchmark fixing the issue with the latter approach (unordered_map + vector): | N | Time (s) | Growth factor (vs previous input) | | - | - | - | | 500 | 0.000040s | NA | | 5000 | 0.000066s | 1.65x | | 50000 | 0.000681s | 10.3x | | 500000 | 0.007084s | 10.4x | I can submit a PR if a fix is worth pursuing. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
