[jira] [Commented] (ARROW-6131) [C++] Optimize the Arrow UTF-8-string-validation

2019-08-04 Thread Yuqi Gu (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899742#comment-16899742
 ] 

Yuqi Gu commented on ARROW-6131:


The origin utf8 benchmark :

{code:java}

Benchmark Time   CPU Iterations

ValidateTinyAscii 7 ns  7 ns  107435339   1.42978GB/s
ValidateTinyNonAscii 16 ns 16 ns   42655054   639.503MB/s
ValidateSmallAscii   29 ns 29 ns   245169454.4671GB/s
ValidateSmallAlmostAscii 91 ns 91 ns7677848   1.51182GB/s
ValidateSmallNonAscii   175 ns175 ns4009837731.98MB/s
ValidateLargeAscii18821 ns  18814 ns  37194   4.95077GB/s
ValidateLargeAlmostAscii  64056 ns  64025 ns  10929   1.45533GB/s
ValidateLargeNonAscii130321 ns 130249 ns   5375   732.909MB/s
{code}


The new algorithm:

{code:java}

Benchmark Time   CPU Iterations

ValidateTinyAscii 6 ns  6 ns  116427650   1.59527GB/s
ValidateTinyNonAscii 17 ns 17 ns   41897276   628.046MB/s
ValidateSmallAscii  117 ns117 ns5964896   1113.14MB/s
ValidateSmallAlmostAscii145 ns145 ns4819232971.76MB/s
ValidateSmallNonAscii   118 ns118 ns5947924   1085.68MB/s
ValidateLargeAscii82297 ns  82247 ns   8511   1.13246GB/s
ValidateLargeAlmostAscii  81145 ns  81138 ns   8627   1.14838GB/s
ValidateLargeNonAscii 81221 ns  81202 ns   8621   1.14805GB/s
{code}





> [C++]  Optimize the Arrow UTF-8-string-validation
> -
>
> Key: ARROW-6131
> URL: https://issues.apache.org/jira/browse/ARROW-6131
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yuqi Gu
>Assignee: Yuqi Gu
>Priority: Major
>
> The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE)
> Range base algorithm:
>   1. Map each byte of input-string to Range table.
>   2. Leverage the Neon 'tbl' instruction to lookup table.
>   3. Find the pattern and set correct table index for each input byte
>   4. Validate input string.
> The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii 
> and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases 
> (The input data is all ascii string).
> The benchmark API is  
> {code:java}
> ValidateUTF8
> {code}
> As far as I know, the data that is all-ascii is unusual on the internet.
> Could you guys please tell me what's the use case scenario for Apache Arrow? 
> Is the Arrow's data that need to be validated  all-ascii string?
> If not, I'd like to submit the patch to accelerate the NonAscii validation.
> As for All-Ascii  validation,  I would like to propose another optimization 
> solution with SIMD in another jira.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6131) [C++] Optimize the Arrow UTF-8-string-validation

2019-08-04 Thread Yuqi Gu (JIRA)
Yuqi Gu created ARROW-6131:
--

 Summary: [C++]  Optimize the Arrow UTF-8-string-validation
 Key: ARROW-6131
 URL: https://issues.apache.org/jira/browse/ARROW-6131
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Yuqi Gu
Assignee: Yuqi Gu


The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE)

Range base algorithm:
  1. Map each byte of input-string to Range table.
  2. Leverage the Neon 'tbl' instruction to lookup table.
  3. Find the pattern and set correct table index for each input byte
  4. Validate input string.

The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii and 
SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases (The 
input data is all ascii string).
The benchmark API is  
{code:java}
ValidateUTF8
{code}


As far as I know, the data that is all-ascii is unusual on the internet.
Could you guys please tell me what's the use case scenario for Apache Arrow? 
Is the Arrow's data that need to be validated  all-ascii string?

If not, I'd like to submit the patch to accelerate the NonAscii validation.

As for All-Ascii  validation,  I would like to propose another optimization 
solution with SIMD in another jira.

















--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6130) [Release] Use 0.15.0 as the next release

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6130:
--
Labels: pull-request-available  (was: )

> [Release] Use 0.15.0 as the next release
> 
>
> Key: ARROW-6130
> URL: https://issues.apache.org/jira/browse/ARROW-6130
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6130) [Release] Use 0.15.0 as the next release

2019-08-04 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-6130:
---

 Summary: [Release] Use 0.15.0 as the next release
 Key: ARROW-6130
 URL: https://issues.apache.org/jira/browse/ARROW-6130
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei
 Fix For: 0.15.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6104) [Rust] [DataFusion] Don't allow bare_trait_objects

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6104:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Don't allow bare_trait_objects
> --
>
> Key: ARROW-6104
> URL: https://issues.apache.org/jira/browse/ARROW-6104
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> Need to remove "{color:#808080}#![allow(bare_trait_objects)]" from cargo.toml 
> and fix compiler warnings
> {color}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6129) Row_groups duplicate Rows

2019-08-04 Thread albertoramon (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

albertoramon updated ARROW-6129:

Description: 
Using Row_Groups to write Parquet, duplicate rows:

    Input: CSV 10 Rows

    Row_Groups=1 --> Output 10 Rows 

    Row_Groups=2 --> Output 20 Rows

  !tes_output.png!

Is this the expected?
attached code snippet and CSV

  was:
Using Row_Groups to write Parquet, duplicate date:

Input: CSV 10 Rows

Row_Groups=1 --> Output 10 Rows !tes_output.png!

Row_Groups=2 --> Output 20 Rows

 

Is this the expected?
[^test01.py]


> Row_groups duplicate Rows
> -
>
> Key: ARROW-6129
> URL: https://issues.apache.org/jira/browse/ARROW-6129
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: albertoramon
>Priority: Major
> Attachments: tes_output.png, test01.py, top10.csv
>
>
> Using Row_Groups to write Parquet, duplicate rows:
>     Input: CSV 10 Rows
>     Row_Groups=1 --> Output 10 Rows 
>     Row_Groups=2 --> Output 20 Rows
>   !tes_output.png!
> Is this the expected?
> attached code snippet and CSV



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6129) Row_groups duplicate Rows

2019-08-04 Thread albertoramon (JIRA)
albertoramon created ARROW-6129:
---

 Summary: Row_groups duplicate Rows
 Key: ARROW-6129
 URL: https://issues.apache.org/jira/browse/ARROW-6129
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
Reporter: albertoramon
 Attachments: tes_output.png, test01.py, top10.csv

Using Row_Groups to write Parquet, duplicate date:

Input: CSV 10 Rows

Row_Groups=1 --> Output 10 Rows !tes_output.png!

Row_Groups=2 --> Output 20 Rows

 

Is this the expected?
[^test01.py]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7

2019-08-04 Thread Paul Suganthan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899659#comment-16899659
 ] 

Paul Suganthan commented on ARROW-6119:
---

pip freeze
altgraph==0.15
appdirs==1.4.3
backcall==0.1.0
cachetools==2.1.0
colorama==0.3.9
cycler==0.10.0
Cython==0.28.4
decorator==4.3.0
google-api-python-client==1.7.4
google-auth==1.5.0
google-auth-httplib2==0.0.3
httplib2==0.11.3
jedi==0.12.1
kiwisolver==1.0.1
macholib==1.9
matplotlib==2.2.2
nose==1.3.7
numpy==1.17.0
packaging==17.1
pandas==0.23.3
parso==0.3.1
pickleshare==0.7.4
pip==19.2.1
prompt-toolkit==2.0.4
psutil==5.4.6
pyarrow==0.14.0
pyasn1==0.4.3
pyasn1-modules==0.2.2
Pygments==2.2.0
pyparsing==2.2.0
pyreadline==2.1
python-dateutil==2.7.3
pytz==2018.5
rsa==3.4.2
setuptools==41.0.1
simplegeneric==0.8.1
six==1.11.0
Tempita==0.5.2
traitlets==4.3.2
uritemplate==3.0.0
virtualenv==16.0.0
wcwidth==0.1.7
wheel==0.33.4
win-unicode-console==0.5
 

Content of "C:\Python37\lib\site-packages\pyarrow"
__init__.pxd
__init__.py
__pycache__
_csv.cp37-win_amd64.pyd
_csv.cpp
_csv.pyx
_cuda.pxd
_cuda.pyx
_flight.cp37-win_amd64.pyd
_flight.cpp
_flight.pyx
_generated_version.py
_json.cp37-win_amd64.pyd
_json.cpp
_json.pyx
_orc.pxd
_orc.pyx
_parquet.cp37-win_amd64.pyd
_parquet.cpp
_parquet.pxd
_parquet.pyx
_plasma.pyx
array.pxi
arrow.dll
arrow.lib
arrow_flight.dll
arrow_flight.lib
arrow_python.dll
arrow_python.lib
benchmark.pxi
benchmark.py
builder.pxi
cares.dll
compat.py
csv.py
cuda.py
error.pxi
feather.pxi
feather.py
filesystem.py
flight.py
gandiva.cp37-win_amd64.pyd
gandiva.cpp
gandiva.dll
gandiva.lib
gandiva.pyx
hdfs.py
include
includes
io.pxi
io-hdfs.pxi
ipc.pxi
ipc.py
json.py
jvm.py
lib.cp37-win_amd64.pyd
lib.cpp
lib.pxd
lib.pyx
lib_api.h
libcrypto-1_1-x64.dll
libprotobuf.dll
libssl-1_1-x64.dll
memory.pxi
orc.py
pandas_compat.py
pandas-shim.pxi
parquet.dll
parquet.lib
parquet.py
plasma.py
public-api.pxi
scalar.pxi
serialization.pxi
serialization.py
table.pxi
tensorflow
tests
types.pxi
types.py
util.py
zlib.dll

> [Python] PyArrow import fails on Windows Python 3.7
> ---
>
> Key: ARROW-6119
> URL: https://issues.apache.org/jira/browse/ARROW-6119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Windows, Python 3.7
>Reporter: Paul Suganthan
>Priority: Major
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified procedure could not be found.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5772) [GLib][Plasma][CUDA] Plasma::Client#refer_object test is failed

2019-08-04 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei updated ARROW-5772:

Issue Type: Bug  (was: Test)

> [GLib][Plasma][CUDA] Plasma::Client#refer_object test is failed
> ---
>
> Key: ARROW-5772
> URL: https://issues.apache.org/jira/browse/ARROW-5772
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GLib
>Affects Versions: 0.14.0
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> {noformat}
> /home/kou/work/cpp/arrow.kou/c_glib/test/plasma/test-plasma-client.rb:75:in 
> `block (2 levels) in '
> /tmp/local/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.3.6/lib/gobject-introspection/loader.rb:533:in
>  `block in define_method'
> /tmp/local/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.3.6/lib/gobject-introspection/loader.rb:616:in
>  `invoke'
> /tmp/local/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.3.6/lib/gobject-introspection/loader.rb:616:in
>  `invoke'
> Error: test: options: GPU device(TestPlasmaClient::#create):
>   Arrow::Error::Io: [plasma][client][refer-object]: IOError: Cuda Driver API 
> call in ../src/arrow/gpu/cuda_context.cc at line 156 failed with code 208: 
> cuIpcOpenMemHandle(, *handle, CU_IPC_MEM_LAZY_ENABLE_PEER_ACCESS)
>   In ../src/arrow/gpu/cuda_context.cc, line 341, code: 
> impl_->OpenIpcBuffer(ipc_handle, )
>   In ../src/plasma/client.cc, line 586, code: 
> context->OpenIpcBuffer(*object->ipc_handle, _handle->ptr)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)