[jira] [Updated] (ARROW-6131) [C++] Optimize the Arrow UTF-8-string-validation

Wes McKinney (Jira) Fri, 10 Apr 2020 07:36:28 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wes McKinney updated ARROW-6131:
--------------------------------
    Component/s: C++

> [C++]  Optimize the Arrow UTF-8-string-validation
> -------------------------------------------------
>
>                 Key: ARROW-6131
>                 URL: https://issues.apache.org/jira/browse/ARROW-6131
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yuqi Gu
>            Assignee: Yuqi Gu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE)
> Range base algorithm:
>   1. Map each byte of input-string to Range table.
>   2. Leverage the Neon 'tbl' instruction to lookup table.
>   3. Find the pattern and set correct table index for each input byte
>   4. Validate input string.
> The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii 
> and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases 
> (The input data is all ascii string).
> The benchmark API is  
> {code:java}
> ValidateUTF8
> {code}
> As far as I know, the data that is all-ascii is unusual on the internet.
> Could you guys please tell me what's the use case scenario for Apache Arrow? 
> Is the Arrow's data that need to be validated  all-ascii string?
> If not, I'd like to submit the patch to accelerate the NonAscii validation.
> As for All-Ascii  validation,  I would like to propose another optimization 
> solution with SIMD in another jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6131) [C++] Optimize the Arrow UTF-8-string-validation

Reply via email to