[
https://issues.apache.org/jira/browse/ARROW-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yibo Cai updated ARROW-15878:
-----------------------------
Description:
Escaping a string with quotes (put an extra quote before a quote) is the
hotspot of csv writer [1]. This can probably be improved, possible approaches:
- Find the next quote with memchr, then memcpy blocks without quotes.
- Check if there are quotes with simd in 8 bytes or 16 bytes, do memcpy if no,
otherwise go slow path.
Should make sure the method doesn't decrease performance too much for strings
with many quotes. And should be similar performance for short strings, which is
common case.
[1]
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L139]
was:
Escaping a string with quotes (put an extra quote before a quote) is the
hotspot of csv writer [1]. This can probably be improved, possible approaches:
- Find the next quote with memchr, then memcpy blocks without quotes.
- Check if there are quotes with simd in 8 bytes or 16 bytes, do memcpy if no,
otherwise go slow path.
Should make sure the method doesn't decrease performance too much for strings
with many quotes, or short strings.
[1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L139
> [C++] Optimize csv writer for string with quotes
> ------------------------------------------------
>
> Key: ARROW-15878
> URL: https://issues.apache.org/jira/browse/ARROW-15878
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Yibo Cai
> Assignee: Yibo Cai
> Priority: Major
>
> Escaping a string with quotes (put an extra quote before a quote) is the
> hotspot of csv writer [1]. This can probably be improved, possible approaches:
> - Find the next quote with memchr, then memcpy blocks without quotes.
> - Check if there are quotes with simd in 8 bytes or 16 bytes, do memcpy if
> no, otherwise go slow path.
> Should make sure the method doesn't decrease performance too much for strings
> with many quotes. And should be similar performance for short strings, which
> is common case.
> [1]
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L139]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)