[ 
https://issues.apache.org/jira/browse/ARROW-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503446#comment-17503446
 ] 

Yibo Cai edited comment on ARROW-15878 at 3/9/22, 9:41 AM:
-----------------------------------------------------------

We already counted quotes for each string to calculate csv cell length, it may 
be desirable to run optimized code only for strings with very few quotes, e.g., 
call approach1 only if {{{}(len(string) > 8) && (len(string) / #quotes > 8){}}}.


was (Author: yibo):
We already counted quotes for each string to calculate csv cell length, it may 
be desirable to run optimized code only for cell with very few quotes, e.g., 
call approach1 only if {{(len(string) > 8) && (len(string) / #quotes > 8)}}.

> [C++] Optimize csv writer for string with quotes
> ------------------------------------------------
>
>                 Key: ARROW-15878
>                 URL: https://issues.apache.org/jira/browse/ARROW-15878
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yibo Cai
>            Assignee: Yibo Cai
>            Priority: Major
>         Attachments: wip.patch
>
>
> Escaping a string with quotes (put an extra quote before a quote) is the 
> hotspot of csv writer [1]. This can probably be improved, possible approaches:
>  - Find the next quote with memchr, then memcpy blocks without quotes.
>  - Check if there are quotes with simd in 8 bytes or 16 bytes, do memcpy if 
> no, otherwise go slow path.
> Should make sure the method doesn't decrease performance too much for strings 
> with many quotes. And should be similar or better performance for short 
> strings, which is common case.
> [1] 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L139]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to