[
https://issues.apache.org/jira/browse/ARROW-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900111#comment-16900111
]
Wes McKinney commented on ARROW-6131:
-------------------------------------
[~yqGu] in which component of the project is UTF8-validation affecting you?
Depending on the case, it might be possible to introduce some option to elect a
different validation algorithm. I agree that maintaining the performance of the
all-ASCII case, particularly in the context of CSV files, is pretty important.
Even in business situations where the primary language is not English, many
all-ASCII CSV files relating to analytics are produced
> [C++] Optimize the Arrow UTF-8-string-validation
> -------------------------------------------------
>
> Key: ARROW-6131
> URL: https://issues.apache.org/jira/browse/ARROW-6131
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Yuqi Gu
> Assignee: Yuqi Gu
> Priority: Major
>
> The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE)
> Range base algorithm:
> 1. Map each byte of input-string to Range table.
> 2. Leverage the Neon 'tbl' instruction to lookup table.
> 3. Find the pattern and set correct table index for each input byte
> 4. Validate input string.
> The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii
> and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases
> (The input data is all ascii string).
> The benchmark API is
> {code:java}
> ValidateUTF8
> {code}
> As far as I know, the data that is all-ascii is unusual on the internet.
> Could you guys please tell me what's the use case scenario for Apache Arrow?
> Is the Arrow's data that need to be validated all-ascii string?
> If not, I'd like to submit the patch to accelerate the NonAscii validation.
> As for All-Ascii validation, I would like to propose another optimization
> solution with SIMD in another jira.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)