arvidjonasson opened a new issue, #48977:
URL: https://github.com/apache/arrow/issues/48977

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   `arrow::schema` construction performance degrades to quadratic time on 
libc++ when there's a lot of fields with duplicate or no names. This is most 
critical when creating wide schemas programmatically.
   
   The performance degradation is caused by the function 
[CreateNameToIndexMap](https://github.com/apache/arrow/blob/75ef03165dc7e54aca0f842f54598833ddbd0a01/cpp/src/arrow/type.cc#L1332-L1339).
 The function calls `std::unordered_multimap::emplace` once per field. On 
libc++, this function call has linear runtime (amortized) for duplicate keys 
because it iterates past every existing entry to emplace the new element at the 
end of the range (see LLVM bugzilla 
[16747](https://bugs.llvm.org/show_bug.cgi?id=16747) and 
[21275](https://bugs.llvm.org/show_bug.cgi?id=21275)).
   
   Some performance benchmarks creating a schema with N int32 fields:
   | N | Time (s) |Growth factor (vs previous input) |
   | - | - | - |
   | 500    | 0.000464958s | NA |
   | 5000   | 0.041429s | 89.1x |
   | 50000  | 4.12813s  | 99.6x |
   | 500000 | 416.985s  | 101.0x |
   
   One fix is to use `map.emplace_hint(map.find(key), ...)` as recommended by 
libc++ maintainers. This avoids the linear scan but may reverse order for 
duplicates. Alternatively, using `std::unordered_map<std::string_view, 
std::vector<int>>` keeps the order but is a slightly larger refactor.
   
   The same benchmark fixing the issue with the latter approach (unordered_map 
+ vector):
   | N | Time (s) | Growth factor (vs previous input) |
   | - | - | - |
   | 500    | 0.000040s | NA |
   | 5000   | 0.000066s | 1.65x |
   | 50000  | 0.000681s | 10.3x |
   | 500000 | 0.007084s | 10.4x |
   
   I can submit a PR if a fix is worth pursuing.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to