[jira] [Commented] (ARROW-6131) [C++] Optimize the Arrow UTF-8-string-validation
[ https://issues.apache.org/jira/browse/ARROW-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899742#comment-16899742 ] Yuqi Gu commented on ARROW-6131: The origin utf8 benchmark : {code:java} Benchmark Time CPU Iterations ValidateTinyAscii 7 ns 7 ns 107435339 1.42978GB/s ValidateTinyNonAscii 16 ns 16 ns 42655054 639.503MB/s ValidateSmallAscii 29 ns 29 ns 245169454.4671GB/s ValidateSmallAlmostAscii 91 ns 91 ns7677848 1.51182GB/s ValidateSmallNonAscii 175 ns175 ns4009837731.98MB/s ValidateLargeAscii18821 ns 18814 ns 37194 4.95077GB/s ValidateLargeAlmostAscii 64056 ns 64025 ns 10929 1.45533GB/s ValidateLargeNonAscii130321 ns 130249 ns 5375 732.909MB/s {code} The new algorithm: {code:java} Benchmark Time CPU Iterations ValidateTinyAscii 6 ns 6 ns 116427650 1.59527GB/s ValidateTinyNonAscii 17 ns 17 ns 41897276 628.046MB/s ValidateSmallAscii 117 ns117 ns5964896 1113.14MB/s ValidateSmallAlmostAscii145 ns145 ns4819232971.76MB/s ValidateSmallNonAscii 118 ns118 ns5947924 1085.68MB/s ValidateLargeAscii82297 ns 82247 ns 8511 1.13246GB/s ValidateLargeAlmostAscii 81145 ns 81138 ns 8627 1.14838GB/s ValidateLargeNonAscii 81221 ns 81202 ns 8621 1.14805GB/s {code} > [C++] Optimize the Arrow UTF-8-string-validation > - > > Key: ARROW-6131 > URL: https://issues.apache.org/jira/browse/ARROW-6131 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yuqi Gu >Assignee: Yuqi Gu >Priority: Major > > The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE) > Range base algorithm: > 1. Map each byte of input-string to Range table. > 2. Leverage the Neon 'tbl' instruction to lookup table. > 3. Find the pattern and set correct table index for each input byte > 4. Validate input string. > The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii > and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases > (The input data is all ascii string). > The benchmark API is > {code:java} > ValidateUTF8 > {code} > As far as I know, the data that is all-ascii is unusual on the internet. > Could you guys please tell me what's the use case scenario for Apache Arrow? > Is the Arrow's data that need to be validated all-ascii string? > If not, I'd like to submit the patch to accelerate the NonAscii validation. > As for All-Ascii validation, I would like to propose another optimization > solution with SIMD in another jira. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6131) [C++] Optimize the Arrow UTF-8-string-validation
Yuqi Gu created ARROW-6131: -- Summary: [C++] Optimize the Arrow UTF-8-string-validation Key: ARROW-6131 URL: https://issues.apache.org/jira/browse/ARROW-6131 Project: Apache Arrow Issue Type: Improvement Reporter: Yuqi Gu Assignee: Yuqi Gu The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE) Range base algorithm: 1. Map each byte of input-string to Range table. 2. Leverage the Neon 'tbl' instruction to lookup table. 3. Find the pattern and set correct table index for each input byte 4. Validate input string. The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases (The input data is all ascii string). The benchmark API is {code:java} ValidateUTF8 {code} As far as I know, the data that is all-ascii is unusual on the internet. Could you guys please tell me what's the use case scenario for Apache Arrow? Is the Arrow's data that need to be validated all-ascii string? If not, I'd like to submit the patch to accelerate the NonAscii validation. As for All-Ascii validation, I would like to propose another optimization solution with SIMD in another jira. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6130) [Release] Use 0.15.0 as the next release
[ https://issues.apache.org/jira/browse/ARROW-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6130: -- Labels: pull-request-available (was: ) > [Release] Use 0.15.0 as the next release > > > Key: ARROW-6130 > URL: https://issues.apache.org/jira/browse/ARROW-6130 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6130) [Release] Use 0.15.0 as the next release
Sutou Kouhei created ARROW-6130: --- Summary: [Release] Use 0.15.0 as the next release Key: ARROW-6130 URL: https://issues.apache.org/jira/browse/ARROW-6130 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Sutou Kouhei Assignee: Sutou Kouhei Fix For: 0.15.0 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6104) [Rust] [DataFusion] Don't allow bare_trait_objects
[ https://issues.apache.org/jira/browse/ARROW-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6104: -- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Don't allow bare_trait_objects > -- > > Key: ARROW-6104 > URL: https://issues.apache.org/jira/browse/ARROW-6104 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > Need to remove "{color:#808080}#![allow(bare_trait_objects)]" from cargo.toml > and fix compiler warnings > {color} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6129) Row_groups duplicate Rows
[ https://issues.apache.org/jira/browse/ARROW-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] albertoramon updated ARROW-6129: Description: Using Row_Groups to write Parquet, duplicate rows: Input: CSV 10 Rows Row_Groups=1 --> Output 10 Rows Row_Groups=2 --> Output 20 Rows !tes_output.png! Is this the expected? attached code snippet and CSV was: Using Row_Groups to write Parquet, duplicate date: Input: CSV 10 Rows Row_Groups=1 --> Output 10 Rows !tes_output.png! Row_Groups=2 --> Output 20 Rows Is this the expected? [^test01.py] > Row_groups duplicate Rows > - > > Key: ARROW-6129 > URL: https://issues.apache.org/jira/browse/ARROW-6129 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: albertoramon >Priority: Major > Attachments: tes_output.png, test01.py, top10.csv > > > Using Row_Groups to write Parquet, duplicate rows: > Input: CSV 10 Rows > Row_Groups=1 --> Output 10 Rows > Row_Groups=2 --> Output 20 Rows > !tes_output.png! > Is this the expected? > attached code snippet and CSV -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6129) Row_groups duplicate Rows
albertoramon created ARROW-6129: --- Summary: Row_groups duplicate Rows Key: ARROW-6129 URL: https://issues.apache.org/jira/browse/ARROW-6129 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1 Reporter: albertoramon Attachments: tes_output.png, test01.py, top10.csv Using Row_Groups to write Parquet, duplicate date: Input: CSV 10 Rows Row_Groups=1 --> Output 10 Rows !tes_output.png! Row_Groups=2 --> Output 20 Rows Is this the expected? [^test01.py] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7
[ https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899659#comment-16899659 ] Paul Suganthan commented on ARROW-6119: --- pip freeze altgraph==0.15 appdirs==1.4.3 backcall==0.1.0 cachetools==2.1.0 colorama==0.3.9 cycler==0.10.0 Cython==0.28.4 decorator==4.3.0 google-api-python-client==1.7.4 google-auth==1.5.0 google-auth-httplib2==0.0.3 httplib2==0.11.3 jedi==0.12.1 kiwisolver==1.0.1 macholib==1.9 matplotlib==2.2.2 nose==1.3.7 numpy==1.17.0 packaging==17.1 pandas==0.23.3 parso==0.3.1 pickleshare==0.7.4 pip==19.2.1 prompt-toolkit==2.0.4 psutil==5.4.6 pyarrow==0.14.0 pyasn1==0.4.3 pyasn1-modules==0.2.2 Pygments==2.2.0 pyparsing==2.2.0 pyreadline==2.1 python-dateutil==2.7.3 pytz==2018.5 rsa==3.4.2 setuptools==41.0.1 simplegeneric==0.8.1 six==1.11.0 Tempita==0.5.2 traitlets==4.3.2 uritemplate==3.0.0 virtualenv==16.0.0 wcwidth==0.1.7 wheel==0.33.4 win-unicode-console==0.5 Content of "C:\Python37\lib\site-packages\pyarrow" __init__.pxd __init__.py __pycache__ _csv.cp37-win_amd64.pyd _csv.cpp _csv.pyx _cuda.pxd _cuda.pyx _flight.cp37-win_amd64.pyd _flight.cpp _flight.pyx _generated_version.py _json.cp37-win_amd64.pyd _json.cpp _json.pyx _orc.pxd _orc.pyx _parquet.cp37-win_amd64.pyd _parquet.cpp _parquet.pxd _parquet.pyx _plasma.pyx array.pxi arrow.dll arrow.lib arrow_flight.dll arrow_flight.lib arrow_python.dll arrow_python.lib benchmark.pxi benchmark.py builder.pxi cares.dll compat.py csv.py cuda.py error.pxi feather.pxi feather.py filesystem.py flight.py gandiva.cp37-win_amd64.pyd gandiva.cpp gandiva.dll gandiva.lib gandiva.pyx hdfs.py include includes io.pxi io-hdfs.pxi ipc.pxi ipc.py json.py jvm.py lib.cp37-win_amd64.pyd lib.cpp lib.pxd lib.pyx lib_api.h libcrypto-1_1-x64.dll libprotobuf.dll libssl-1_1-x64.dll memory.pxi orc.py pandas_compat.py pandas-shim.pxi parquet.dll parquet.lib parquet.py plasma.py public-api.pxi scalar.pxi serialization.pxi serialization.py table.pxi tensorflow tests types.pxi types.py util.py zlib.dll > [Python] PyArrow import fails on Windows Python 3.7 > --- > > Key: ARROW-6119 > URL: https://issues.apache.org/jira/browse/ARROW-6119 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Windows, Python 3.7 >Reporter: Paul Suganthan >Priority: Major > > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified procedure could not be found. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5772) [GLib][Plasma][CUDA] Plasma::Client#refer_object test is failed
[ https://issues.apache.org/jira/browse/ARROW-5772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei updated ARROW-5772: Issue Type: Bug (was: Test) > [GLib][Plasma][CUDA] Plasma::Client#refer_object test is failed > --- > > Key: ARROW-5772 > URL: https://issues.apache.org/jira/browse/ARROW-5772 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Affects Versions: 0.14.0 >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > {noformat} > /home/kou/work/cpp/arrow.kou/c_glib/test/plasma/test-plasma-client.rb:75:in > `block (2 levels) in ' > /tmp/local/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.3.6/lib/gobject-introspection/loader.rb:533:in > `block in define_method' > /tmp/local/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.3.6/lib/gobject-introspection/loader.rb:616:in > `invoke' > /tmp/local/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.3.6/lib/gobject-introspection/loader.rb:616:in > `invoke' > Error: test: options: GPU device(TestPlasmaClient::#create): > Arrow::Error::Io: [plasma][client][refer-object]: IOError: Cuda Driver API > call in ../src/arrow/gpu/cuda_context.cc at line 156 failed with code 208: > cuIpcOpenMemHandle(, *handle, CU_IPC_MEM_LAZY_ENABLE_PEER_ACCESS) > In ../src/arrow/gpu/cuda_context.cc, line 341, code: > impl_->OpenIpcBuffer(ipc_handle, ) > In ../src/plasma/client.cc, line 586, code: > context->OpenIpcBuffer(*object->ipc_handle, _handle->ptr) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016)