[jira] [Comment Edited] (ARROW-6015) [Python] pyarrow wheel: `DLL load failed` when importing on windows
[ https://issues.apache.org/jira/browse/ARROW-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920036#comment-16920036 ] Kazuaki Ishizaki edited comment on ARROW-6015 at 8/31/19 5:57 AM: -- I believe that I identified how to fix this issue. To install {{Microsoft Visual C++ Redistributable for Visual Studio 2015, 2017 and 2019.}} from [https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads] avoids this error. I think that this problem does not occur with conda. This problem occurs with only pip. The following is my validation step. If someone double-checks it, we would appreciate it. {code:java} // Install Windows10 enterprise (no additional application is installed) > mkdir c:\pyarrow > cd c:\pyarrow > bitsadmin /TRANSFER htmlget > [https://www.python.org/ftp/python/3.7.4/python-3.7.4-embed-amd64.zip] > c:\pyarrow\python-3.7.4-embed-amd64.zip extract all python-3.7.4-embed-amd64.zip to c:\pyarrow\python-3.7.4-embed-amd64 from Explorer > cd python-3.7.4-embed-amd64 notepad python37._pth ... #import site <=== remove # in this line > type python37._pth python37.zip . # Uncomment to run site.main() automatically import site > python get-pip.py ... Successfully installed pip-19.2.3 setuptools-41.2.0 wheel-0.33.6 > python -m pip install pyarrow C:\pyarrow\python-3.7.4-embed-amd64>python -m pip install pyarrow Collecting pyarrow Downloading [https://files.pythonhosted.org/packages/97/7c/0ea4554d64c6ed3d6d4f8da492df287d2496adbab2b35c01433cf1344521/pyarrow-0.14.0-cp37-cp37m-win_amd64.whl] (17.4MB) ... Collecting numpy>=1.14 (from pyarrow) Downloading [https://files.pythonhosted.org/packages/cb/41/05fbf6944b098eb9d53e8a29a9dbfa20a7448f3254fb71499746a29a1b2d/numpy-1.17.1-cp37-cp37m-win_amd64.whl] (12.8MB)| ... Collecting six>=1.0.0 (from pyarrow) Downloading [https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl] Installing collected packages: numpy, six, pyarrow WARNING: The script f2py.exe is installed in 'C:\pyarrow\python-3.7.4-embed-amd64\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The script plasma_store.exe is installed in 'C:\pyarrow\python-3.7.4-embed-amd64\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. Successfully installed numpy-1.17.1 pyarrow-0.14.0 six-1.12.0 > python -c "import pyarrow" Traceback (most recent call last): File "", line 1, in File "C:\pyarrow\python-3.7.4-embed-amd64\lib\site-packages\pyarrow__init__.py", line 49, in from pyarrow.lib import cpu_count, set_cpu_count ImportError: DLL load failed: The specified module could not be found. > python -m pip freeze numpy==1.17.1 pyarrow==0.14.0 six==1.12.0 > dir Lib\site-packages\pyarrow Volume in drive C is OS Volume Serial Number is 1234-5678| Directory of C:\pyarrow\python-3.7.4-embed-amd64\Lib\site-packages\pyarrow 08/31/2019 05:42 AM . 08/31/2019 05:42 AM .. 08/31/2019 05:42 AM 47,658 array.pxi 08/31/2019 05:42 AM 5,748,736 arrow.dll 08/31/2019 05:42 AM 1,653,120 arrow.lib 08/31/2019 05:42 AM 1,795,072 arrow_flight.dll 08/31/2019 05:42 AM 121,062 arrow_flight.lib 08/31/2019 05:42 AM 910,848 arrow_python.dll 08/31/2019 05:42 AM 119,994 arrow_python.lib 08/31/2019 05:42 AM 869 benchmark.pxi 08/31/2019 05:42 AM 895 benchmark.py 08/31/2019 05:42 AM 2,774 builder.pxi 08/31/2019 05:42 AM 81,920 cares.dll 08/31/2019 05:42 AM 3,691 compat.py 08/31/2019 05:42 AM 911 csv.py 08/31/2019 05:42 AM 1,126 cuda.py 08/31/2019 05:42 AM 3,161 error.pxi 08/31/2019 05:42 AM 4,026 feather.pxi 08/31/2019 05:42 AM 7,291 feather.py 08/31/2019 05:42 AM 12,472 filesystem.py 08/31/2019 05:42 AM 1,286 flight.py 08/31/2019 05:42 AM 186,880 gandiva.cp37-win_amd64.pyd 08/31/2019 05:42 AM 791,664 gandiva.cpp 08/31/2019 05:42 AM 22,094,848 gandiva.dll 08/31/2019 05:42 AM 305,626 gandiva.lib 08/31/2019 05:42 AM 16,553 gandiva.pyx 08/31/2019 05:42 AM 7,032 hdfs.py 08/31/2019 05:42 AM include 08/31/2019 05:42 AM includes 08/31/2019 05:42 AM 13,995 io-hdfs.pxi 08/31/2019 05:42 AM 48,879 io.pxi 08/31/2019 05:42 AM 15,981 ipc.pxi 08/31/2019 05:42 AM 6,178 ipc.py 08/31/2019 05:42 AM 897 json.py 08/31/2019 05:42 AM 8,623 jvm.py 08/31/2019 05:42 AM 1,553,408 lib.cp37-win_amd64.pyd 08/31/2019 05:42 AM 6,756,155 lib.cpp 08/31/2019 05:42 AM 10,652 lib.pxd 08/31/2019 05:42 AM 3,570 lib.pyx 08/31/2019 05:42 AM 3,243,008 libcrypto-1_1-x64.dll 08/31/2019 05:42 AM 2,613,248 libprotobuf.dll 08/31/2019 05:42 AM 650,240 libssl-1_1-x64.dll 08/31/2019 05:42 AM 13,435 lib_api.h 08/31/2019 05:42 AM 4,724 memory.pxi 08/31/2019 05:42 AM 4,912 orc.py 08/31/2019 05:42 AM 5,789 pandas-shim.pxi 08/31/2019 05:42 AM 33,456 pandas_compat.py 08/31/2019
[jira] [Commented] (ARROW-6015) [Python] pyarrow wheel: `DLL load failed` when importing on windows
[ https://issues.apache.org/jira/browse/ARROW-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920036#comment-16920036 ] Kazuaki Ishizaki commented on ARROW-6015: - I believe that I identified how to fix this issue. To install {{Microsoft Visual C++ Redistributable for Visual Studio 2015, 2017 and 2019.}} from https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads avoids this error. I think that this problem does not occur with conda. This problem occurs with only pip. The following is my validation step. If someone double-checks it, we would appreciate it. {{code}} https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads Install Windows10 enterprise (no additional application is installed) > mkdir c:\pyarrow > cd c:\pyarrow > bitsadmin /TRANSFER htmlget > https://www.python.org/ftp/python/3.7.4/python-3.7.4-embed-amd64.zip > c:\pyarrow\python-3.7.4-embed-amd64.zip extract all python-3.7.4-embed-amd64.zip to c:\pyarrow\python-3.7.4-embed-amd64 from Explorer > cd python-3.7.4-embed-amd64 notepad python37._pth ... #import site <=== remove # in this line > type python37._pth python37.zip . # Uncomment to run site.main() automatically import site > python get-pip.py ... Successfully installed pip-19.2.3 setuptools-41.2.0 wheel-0.33.6 > python -m pip install pyarrow C:\pyarrow\python-3.7.4-embed-amd64>python -m pip install pyarrow Collecting pyarrow Downloading https://files.pythonhosted.org/packages/97/7c/0ea4554d64c6ed3d6d4f8da492df287d2496adbab2b35c01433cf1344521/pyarrow-0.14.0-cp37-cp37m-win_amd64.whl (17.4MB) || 17.4MB 6.4MB/s Collecting numpy>=1.14 (from pyarrow) Downloading https://files.pythonhosted.org/packages/cb/41/05fbf6944b098eb9d53e8a29a9dbfa20a7448f3254fb71499746a29a1b2d/numpy-1.17.1-cp37-cp37m-win_amd64.whl (12.8MB) || 12.8MB ... Collecting six>=1.0.0 (from pyarrow) Downloading https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl Installing collected packages: numpy, six, pyarrow WARNING: The script f2py.exe is installed in 'C:\pyarrow\python-3.7.4-embed-amd64\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The script plasma_store.exe is installed in 'C:\pyarrow\python-3.7.4-embed-amd64\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. Successfully installed numpy-1.17.1 pyarrow-0.14.0 six-1.12.0 > python -c "import pyarrow" Traceback (most recent call last): File "", line 1, in File "C:\pyarrow\python-3.7.4-embed-amd64\lib\site-packages\pyarrow\__init__.py", line 49, in from pyarrow.lib import cpu_count, set_cpu_count ImportError: DLL load failed: The specified module could not be found. > python -m pip freeze numpy==1.17.1 pyarrow==0.14.0 six==1.12.0 > dir Lib\site-packages\pyarrow Volume in drive C is OS Volume Serial Number is 1234-5678 Directory of C:\pyarrow\python-3.7.4-embed-amd64\Lib\site-packages\pyarrow 08/31/2019 05:42 AM . 08/31/2019 05:42 AM .. 08/31/2019 05:42 AM47,658 array.pxi 08/31/2019 05:42 AM 5,748,736 arrow.dll 08/31/2019 05:42 AM 1,653,120 arrow.lib 08/31/2019 05:42 AM 1,795,072 arrow_flight.dll 08/31/2019 05:42 AM 121,062 arrow_flight.lib 08/31/2019 05:42 AM 910,848 arrow_python.dll 08/31/2019 05:42 AM 119,994 arrow_python.lib 08/31/2019 05:42 AM 869 benchmark.pxi 08/31/2019 05:42 AM 895 benchmark.py 08/31/2019 05:42 AM 2,774 builder.pxi 08/31/2019 05:42 AM81,920 cares.dll 08/31/2019 05:42 AM 3,691 compat.py 08/31/2019 05:42 AM 911 csv.py 08/31/2019 05:42 AM 1,126 cuda.py 08/31/2019 05:42 AM 3,161 error.pxi 08/31/2019 05:42 AM 4,026 feather.pxi 08/31/2019 05:42 AM 7,291 feather.py 08/31/2019 05:42 AM12,472 filesystem.py 08/31/2019 05:42 AM 1,286 flight.py 08/31/2019 05:42 AM 186,880 gandiva.cp37-win_amd64.pyd 08/31/2019 05:42 AM 791,664 gandiva.cpp 08/31/2019 05:42 AM22,094,848 gandiva.dll 08/31/2019 05:42 AM 305,626 gandiva.lib 08/31/2019 05:42 AM16,553 gandiva.pyx 08/31/2019 05:42 AM 7,032 hdfs.py 08/31/2019 05:42 AM include 08/31/2019 05:42 AM includes 08/31/2019 05:42 AM13,995 io-hdfs.pxi 08/31/2019 05:42 AM48,879 io.pxi 08/31/2019 05:42 AM15,981 ipc.pxi 08/31/2019 05:42 AM 6,178 ipc.py 08/31/2019 05:42 AM
[jira] [Resolved] (ARROW-6099) [JAVA] Has the ability to not using slf4j logging framework
[ https://issues.apache.org/jira/browse/ARROW-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6099. Resolution: Won't Fix Closing for now, more discussion on the mailing list might be warranted and we can reopen. > [JAVA] Has the ability to not using slf4j logging framework > --- > > Key: ARROW-6099 > URL: https://issues.apache.org/jira/browse/ARROW-6099 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 0.14.1 >Reporter: Haowei Yu >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > Currently, the java library directly calls slf4j api, and there is no > abstract layer. This leads to user need to install slf4j as a requirement > even if we don't use slf4j at all. > > It is best if you can change the slf4j dependency scope to provided and log > content only if slf4j jar file is provided at runtime. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6099) [JAVA] Has the ability to not using slf4j logging framework
[ https://issues.apache.org/jira/browse/ARROW-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920034#comment-16920034 ] Micah Kornfield commented on ARROW-6099: See discussion on the PR [~jacq...@dremio.com] vetoed the patch. > [JAVA] Has the ability to not using slf4j logging framework > --- > > Key: ARROW-6099 > URL: https://issues.apache.org/jira/browse/ARROW-6099 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 0.14.1 >Reporter: Haowei Yu >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > Currently, the java library directly calls slf4j api, and there is no > abstract layer. This leads to user need to install slf4j as a requirement > even if we don't use slf4j at all. > > It is best if you can change the slf4j dependency scope to provided and log > content only if slf4j jar file is provided at runtime. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6247) [Java] Provide a common interface for float4 and float8 vectors
[ https://issues.apache.org/jira/browse/ARROW-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6247. Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 5132 [https://github.com/apache/arrow/pull/5132] > [Java] Provide a common interface for float4 and float8 vectors > --- > > Key: ARROW-6247 > URL: https://issues.apache.org/jira/browse/ARROW-6247 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 40m > Remaining Estimate: 0h > > We want to provide an interface for floating point vectors (float4 & float8). > This interface will make it convenient for many operations on a vector. With > this interface, the client code will be greatly simplified, with many > branches/switch removed. > > The design is similar to BaseIntVector (the interface for all integer > vectors). We provide 3 methods for setting & getting floating point values: > setWithPossibleTruncate > setSafeWithPossibleTruncate > getValueAsDouble -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6031) [Java] Support iterating a vector by ArrowBufPointer
[ https://issues.apache.org/jira/browse/ARROW-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6031. Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 4950 [https://github.com/apache/arrow/pull/4950] > [Java] Support iterating a vector by ArrowBufPointer > > > Key: ARROW-6031 > URL: https://issues.apache.org/jira/browse/ARROW-6031 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > Provide the functionality to traverse a vector (fixed-width vector & > variable-width vector) by an iterator. This is convenient for scenarios when > accessing vector elements in sequence. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6397) [C++][CI] Fix S3 minio failure
[ https://issues.apache.org/jira/browse/ARROW-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei resolved ARROW-6397. - Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 5238 [https://github.com/apache/arrow/pull/5238] > [C++][CI] Fix S3 minio failure > -- > > Key: ARROW-6397 > URL: https://issues.apache.org/jira/browse/ARROW-6397 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Continuous Integration >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > See > [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27065941/job/gwjmr2hudm7693ef] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-4095) [C++] Implement optimizations for dictionary unification where dictionaries are prefixes of the unified dictionary
[ https://issues.apache.org/jira/browse/ARROW-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-4095. Resolution: Fixed Issue resolved by pull request 5230 [https://github.com/apache/arrow/pull/5230] > [C++] Implement optimizations for dictionary unification where dictionaries > are prefixes of the unified dictionary > -- > > Key: ARROW-4095 > URL: https://issues.apache.org/jira/browse/ARROW-4095 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > In the event that the unified dictionary contains other dictionaries as > prefixes (e.g. as the result of delta dictionaries), we can avoid memory > allocation and index transposition. > See discussion at > https://github.com/apache/arrow/pull/3165#discussion_r243020982 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6220) [Java] Add API to avro adapter to limit number of rows returned at a time.
[ https://issues.apache.org/jira/browse/ARROW-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6220: -- Labels: avro pull-request-available (was: avro) > [Java] Add API to avro adapter to limit number of rows returned at a time. > -- > > Key: ARROW-6220 > URL: https://issues.apache.org/jira/browse/ARROW-6220 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java >Reporter: Micah Kornfield >Assignee: Ji Liu >Priority: Major > Labels: avro, pull-request-available > > We can either let clients iterate or ideally provide an iterator interface. > This is important for large avro data and was also discussed as something > readers/adapters should haven. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6119) [Python] PyArrow wheel import fails on Windows Python 3.7
[ https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920012#comment-16920012 ] Kazuaki Ishizaki commented on ARROW-6119: - In my environment, I can reproduce this error using 0.13.0 with embeddable Python on Windows 10 that have been just installed (i.e. install no application). Does anyone see the failure in 0.13.0? On the other hand, I can succeed to import pyarrow 0.14.1 in miniconda thru {{conda install}}. > [Python] PyArrow wheel import fails on Windows Python 3.7 > - > > Key: ARROW-6119 > URL: https://issues.apache.org/jira/browse/ARROW-6119 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Windows, Python 3.7 >Reporter: Paul Suganthan >Priority: Major > Labels: wheel > Fix For: 0.15.0 > > > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified procedure could not be found. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6402) [C++] Suppress sign-compare warning with g++ 9.2.1
[ https://issues.apache.org/jira/browse/ARROW-6402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6402: -- Labels: pull-request-available (was: ) > [C++] Suppress sign-compare warning with g++ 9.2.1 > -- > > Key: ARROW-6402 > URL: https://issues.apache.org/jira/browse/ARROW-6402 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > Labels: pull-request-available > > {noformat} > ../src/arrow/array/builder_union.cc: In constructor > 'arrow::BasicUnionBuilder::BasicUnionBuilder(arrow::MemoryPool*, > arrow::UnionMode::type, const > std::vector >&, const > std::shared_ptr&)': > ../src/arrow/util/logging.h:86:55: error: comparison of integer > expressions of different signedness: > 'std::vector >::size_type' {aka 'long > unsigned int'} and 'signed char' [-Werror=sign-compare] >86 | #define ARROW_CHECK_LT(val1, val2) ARROW_CHECK((val1) < (val2)) > |~~~^~~~ > ../src/arrow/util/macros.h:43:52: note: in definition of macro > 'ARROW_PREDICT_TRUE' >43 | #define ARROW_PREDICT_TRUE(x) (__builtin_expect(!!(x), 1)) > |^ > ../src/arrow/util/logging.h:86:36: note: in expansion of macro > 'ARROW_CHECK' >86 | #define ARROW_CHECK_LT(val1, val2) ARROW_CHECK((val1) < (val2)) > |^~~ > ../src/arrow/util/logging.h:135:19: note: in expansion of macro > 'ARROW_CHECK_LT' > 135 | #define DCHECK_LT ARROW_CHECK_LT > | ^~ > ../src/arrow/array/builder_union.cc:79:3: note: in expansion of macro > 'DCHECK_LT' >79 | DCHECK_LT(type_id_to_children_.size(), > std::numeric_limits::max()); > | ^ > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (ARROW-6119) [Python] PyArrow wheel import fails on Windows Python 3.7
[ https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919906#comment-16919906 ] Kazuaki Ishizaki edited comment on ARROW-6119 at 8/31/19 4:08 AM: -- I can reproduce this error using 0.14.0 and 0.14.1 thru pip with embeddable Python on Windows 10 that have been just installed (i.e. install no application). I will try it with conda tomorrow. was (Author: kiszk): I can reproduce this error using 0.14.0 and 0.14.1 with embeddable Python on Windows 10 that have been just installed (i.e. install no application). I will try it with conda tomorrow. > [Python] PyArrow wheel import fails on Windows Python 3.7 > - > > Key: ARROW-6119 > URL: https://issues.apache.org/jira/browse/ARROW-6119 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Windows, Python 3.7 >Reporter: Paul Suganthan >Priority: Major > Labels: wheel > Fix For: 0.15.0 > > > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified procedure could not be found. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6402) [C++] Suppress sign-compare warning with g++ 9.2.1
Sutou Kouhei created ARROW-6402: --- Summary: [C++] Suppress sign-compare warning with g++ 9.2.1 Key: ARROW-6402 URL: https://issues.apache.org/jira/browse/ARROW-6402 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Sutou Kouhei Assignee: Sutou Kouhei {noformat} ../src/arrow/array/builder_union.cc: In constructor 'arrow::BasicUnionBuilder::BasicUnionBuilder(arrow::MemoryPool*, arrow::UnionMode::type, const std::vector >&, const std::shared_ptr&)': ../src/arrow/util/logging.h:86:55: error: comparison of integer expressions of different signedness: 'std::vector >::size_type' {aka 'long unsigned int'} and 'signed char' [-Werror=sign-compare] 86 | #define ARROW_CHECK_LT(val1, val2) ARROW_CHECK((val1) < (val2)) |~~~^~~~ ../src/arrow/util/macros.h:43:52: note: in definition of macro 'ARROW_PREDICT_TRUE' 43 | #define ARROW_PREDICT_TRUE(x) (__builtin_expect(!!(x), 1)) |^ ../src/arrow/util/logging.h:86:36: note: in expansion of macro 'ARROW_CHECK' 86 | #define ARROW_CHECK_LT(val1, val2) ARROW_CHECK((val1) < (val2)) |^~~ ../src/arrow/util/logging.h:135:19: note: in expansion of macro 'ARROW_CHECK_LT' 135 | #define DCHECK_LT ARROW_CHECK_LT | ^~ ../src/arrow/array/builder_union.cc:79:3: note: in expansion of macro 'DCHECK_LT' 79 | DCHECK_LT(type_id_to_children_.size(), std::numeric_limits::max()); | ^ {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6265) [Java] Avro adapter implement Array/Map/Fixed type
[ https://issues.apache.org/jira/browse/ARROW-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6265. Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 5099 [https://github.com/apache/arrow/pull/5099] > [Java] Avro adapter implement Array/Map/Fixed type > -- > > Key: ARROW-6265 > URL: https://issues.apache.org/jira/browse/ARROW-6265 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Critical > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 12h 20m > Remaining Estimate: 0h > > Support Array/Map/Fixed type in avro adapter. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-2769) [C++][Python] Deprecate and rename add_metadata methods
[ https://issues.apache.org/jira/browse/ARROW-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei resolved ARROW-2769. - Resolution: Fixed Issue resolved by pull request 5232 [https://github.com/apache/arrow/pull/5232] > [C++][Python] Deprecate and rename add_metadata methods > --- > > Key: ARROW-2769 > URL: https://issues.apache.org/jira/browse/ARROW-2769 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Deprecate and replace `pyarrow.Field.add_metadata` (and other likely named > methods) with replace_metadata, set_metadata or with_metadata. Knowing > Spark's immutable API, I would have chosen with_metadata but I guess this is > probably not what the average Python user would expect as naming. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-2769) [C++][Python] Deprecate and rename add_metadata methods
[ https://issues.apache.org/jira/browse/ARROW-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei updated ARROW-2769: Component/s: C++ > [C++][Python] Deprecate and rename add_metadata methods > --- > > Key: ARROW-2769 > URL: https://issues.apache.org/jira/browse/ARROW-2769 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > Deprecate and replace `pyarrow.Field.add_metadata` (and other likely named > methods) with replace_metadata, set_metadata or with_metadata. Knowing > Spark's immutable API, I would have chosen with_metadata but I guess this is > probably not what the average Python user would expect as naming. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-2769) [C++][Python] Deprecate and rename add_metadata methods
[ https://issues.apache.org/jira/browse/ARROW-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei updated ARROW-2769: Summary: [C++][Python] Deprecate and rename add_metadata methods (was: [Python] Deprecate and rename add_metadata methods) > [C++][Python] Deprecate and rename add_metadata methods > --- > > Key: ARROW-2769 > URL: https://issues.apache.org/jira/browse/ARROW-2769 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > Deprecate and replace `pyarrow.Field.add_metadata` (and other likely named > methods) with replace_metadata, set_metadata or with_metadata. Knowing > Spark's immutable API, I would have chosen with_metadata but I guess this is > probably not what the average Python user would expect as naming. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6094) [Format][Flight] Add GetFlightSchema to Flight RPC
[ https://issues.apache.org/jira/browse/ARROW-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6094. Resolution: Fixed Issue resolved by pull request 4980 [https://github.com/apache/arrow/pull/4980] > [Format][Flight] Add GetFlightSchema to Flight RPC > -- > > Key: ARROW-6094 > URL: https://issues.apache.org/jira/browse/ARROW-6094 > Project: Apache Arrow > Issue Type: Task > Components: C++, FlightRPC, Java, Python >Reporter: Ryan Murray >Assignee: Ryan Murray >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > Implement GetFlightSchema as per > https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing > and > https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (ARROW-4668) [C++] Support GCP BigQuery Storage API
[ https://issues.apache.org/jira/browse/ARROW-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned ARROW-4668: -- Assignee: (was: Micah Kornfield) > [C++] Support GCP BigQuery Storage API > -- > > Key: ARROW-4668 > URL: https://issues.apache.org/jira/browse/ARROW-4668 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Micah Kornfield >Priority: Major > Labels: filesystem > Fix For: 1.0.0 > > > Docs: [https://cloud.google.com/bigquery/docs/reference/storage/] > Need to investigate the best way to do this maybe just see if we can build > our client on GCP (once a protobuf definition is published to > [https://github.com/googleapis/googleapis/tree/master/google)?|https://github.com/googleapis/googleapis/tree/master/google)] > > This will serve as a parent issue, and sub-issues will be added for subtasks > if necessary. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-4668) [C++] Support GCP BigQuery Storage API
[ https://issues.apache.org/jira/browse/ARROW-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1691#comment-1691 ] Micah Kornfield commented on ARROW-4668: Wes is correct. I'll also add that either this (or even a higher level wrapper around BQ) or flight would make a good test case for DataSet APIs to make sure they are generic enough. I won't be getting to this anytime soon, so I'm going to unassign it from myself. I have some sample code on my work computer that I will also try to share to show how the API can be accessed in a simple scenario. > [C++] Support GCP BigQuery Storage API > -- > > Key: ARROW-4668 > URL: https://issues.apache.org/jira/browse/ARROW-4668 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: filesystem > Fix For: 1.0.0 > > > Docs: [https://cloud.google.com/bigquery/docs/reference/storage/] > Need to investigate the best way to do this maybe just see if we can build > our client on GCP (once a protobuf definition is published to > [https://github.com/googleapis/googleapis/tree/master/google)?|https://github.com/googleapis/googleapis/tree/master/google)] > > This will serve as a parent issue, and sub-issues will be added for subtasks > if necessary. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6401) [Java] Implement dictionary-encoded subfields for Struct type
[ https://issues.apache.org/jira/browse/ARROW-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6401: -- Labels: pull-request-available (was: ) > [Java] Implement dictionary-encoded subfields for Struct type > - > > Key: ARROW-6401 > URL: https://issues.apache.org/jira/browse/ARROW-6401 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > > Implement dictionary-encoded subfields for Struct type. > Each child vector will have a dictionary, the dictionary vector is struct > type and holds all dictionaries. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6401) [Java] Implement dictionary-encoded subfields for Struct type
Ji Liu created ARROW-6401: - Summary: [Java] Implement dictionary-encoded subfields for Struct type Key: ARROW-6401 URL: https://issues.apache.org/jira/browse/ARROW-6401 Project: Apache Arrow Issue Type: Sub-task Components: Java Reporter: Ji Liu Assignee: Ji Liu Implement dictionary-encoded subfields for Struct type. Each child vector will have a dictionary, the dictionary vector is struct type and holds all dictionaries. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6078) [Java] Implement dictionary-encoded subfields for List type
[ https://issues.apache.org/jira/browse/ARROW-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6078. Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 4972 [https://github.com/apache/arrow/pull/4972] > [Java] Implement dictionary-encoded subfields for List type > --- > > Key: ARROW-6078 > URL: https://issues.apache.org/jira/browse/ARROW-6078 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 10.5h > Remaining Estimate: 0h > > For example, int type List (valueCount = 5) has data like below: > 10, 20 > 10, 20 > 30, 40, 50 > 30, 40, 50 > 10, 20 > could be encoded to: > 0, 1 > 0, 1 > 2, 3, 4 > 2, 3, 4 > 0, 1 > with list type dictionary > 10, 20, 30, 40, 50 > or > 10, > 20, > 30, > 40, > 50 > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5922) [Python] Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API
[ https://issues.apache.org/jira/browse/ARROW-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919936#comment-16919936 ] Saurabh Bajaj commented on ARROW-5922: -- Try setting the environment variable ARROW_LIBHDFS_DIR to the explicit location of libhdfs.so at the worker nodes. That's what worked for me. > [Python] Unable to connect to HDFS from a worker/data node on a Kerberized > cluster using pyarrow' hdfs API > -- > > Key: ARROW-5922 > URL: https://issues.apache.org/jira/browse/ARROW-5922 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Unix >Reporter: Saurabh Bajaj >Priority: Major > Fix For: 0.14.0 > > > Here's what I'm trying: > {{```}} > {{import pyarrow as pa }} > {{conf = \{"hadoop.security.authentication": "kerberos"} }} > {{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}} > {{```}} > However, when I submit this job to the cluster using {{Dask-YARN}}, I get the > following error: > ``` > {{File "test/run.py", line 3 fs = > pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File > "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 211, in connect File > "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in > pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}} > {{```}} > I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however > I run into the same error. Since the error is not descriptive, I'm not sure > which setting needs to be altered. Any clues anyone? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6119) [Python] PyArrow wheel import fails on Windows Python 3.7
[ https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919906#comment-16919906 ] Kazuaki Ishizaki commented on ARROW-6119: - I can reproduce this error using 0.14.0 and 0.14.1 with embeddable Python on Windows 10 that have been just installed (i.e. install no application). I will try it with conda tomorrow. > [Python] PyArrow wheel import fails on Windows Python 3.7 > - > > Key: ARROW-6119 > URL: https://issues.apache.org/jira/browse/ARROW-6119 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Windows, Python 3.7 >Reporter: Paul Suganthan >Priority: Major > Labels: wheel > Fix For: 0.15.0 > > > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified procedure could not be found. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6400) Arrow Java Library Build Error
Tanveer created ARROW-6400: -- Summary: Arrow Java Library Build Error Key: ARROW-6400 URL: https://issues.apache.org/jira/browse/ARROW-6400 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 0.14.1 Reporter: Tanveer Attachments: Screenshot from 2019-08-30 23-16-25.png, Screenshot from 2019-08-30 23-44-34.png Arrow Java Library is not being built with both 'master' and 'maint-0.14.x ' branches. Please see the attachments. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (ARROW-6310) [C++] Write 64-bit integers as strings in JSON integration test files
[ https://issues.apache.org/jira/browse/ARROW-6310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Kietzman reassigned ARROW-6310: Assignee: Benjamin Kietzman > [C++] Write 64-bit integers as strings in JSON integration test files > - > > Key: ARROW-6310 > URL: https://issues.apache.org/jira/browse/ARROW-6310 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Wes McKinney >Assignee: Benjamin Kietzman >Priority: Major > Fix For: 0.15.0 > > > C++ side of ARROW-1875 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6392) [Python][Flight] list_actions Server RPC is not tested in test_flight.py, nor is return value validated
[ https://issues.apache.org/jira/browse/ARROW-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6392: -- Labels: pull-request-available (was: ) > [Python][Flight] list_actions Server RPC is not tested in test_flight.py, nor > is return value validated > --- > > Key: ARROW-6392 > URL: https://issues.apache.org/jira/browse/ARROW-6392 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC, Python >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > This server method is implemented and part of the Python server vtable, but > it is not tested. If you mistakenly return a "string" action type, it will > pass silently. We might want to constrain the output to be ActionType or a > tuple -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6015) [Python] pyarrow wheel: `DLL load failed` when importing on windows
[ https://issues.apache.org/jira/browse/ARROW-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919876#comment-16919876 ] Kazuaki Ishizaki commented on ARROW-6015: - I see. Thank you for your quick response. It looks more complex. Have we already identified which libraries are missed when this failure occurs? Or, haven't we identified yet? > [Python] pyarrow wheel: `DLL load failed` when importing on windows > > > Key: ARROW-6015 > URL: https://issues.apache.org/jira/browse/ARROW-6015 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging, Python >Affects Versions: 0.14.1 >Reporter: Ruslan Kuprieiev >Priority: Major > Labels: wheel > Fix For: 0.15.0 > > > When installing pyarrow 0.14.1 on windows 10 x64 with python 3.7, you get: > >>> import pyarrow > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified module could not be found. > On 0.14.0 everything works fine. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6015) [Python] pyarrow wheel: `DLL load failed` when importing on windows
[ https://issues.apache.org/jira/browse/ARROW-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919847#comment-16919847 ] Antoine Pitrou commented on ARROW-6015: --- This can of issue depends on which DLLs are already installed on your system. So if the wheel is missing e.g. some compression libraries (such as zstd or brotli) but you have them on your system already, the wheel will work fine for you. This is also what makes it more difficult to ensure that Windows wheels are correctly generated... > [Python] pyarrow wheel: `DLL load failed` when importing on windows > > > Key: ARROW-6015 > URL: https://issues.apache.org/jira/browse/ARROW-6015 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging, Python >Affects Versions: 0.14.1 >Reporter: Ruslan Kuprieiev >Priority: Major > Labels: wheel > Fix For: 0.15.0 > > > When installing pyarrow 0.14.1 on windows 10 x64 with python 3.7, you get: > >>> import pyarrow > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified module could not be found. > On 0.14.0 everything works fine. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6015) [Python] pyarrow wheel: `DLL load failed` when importing on windows
[ https://issues.apache.org/jira/browse/ARROW-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919843#comment-16919843 ] Kazuaki Ishizaki commented on ARROW-6015: - I cannot reproduce this issue on my Windows 10 environment by using two pythons (conda and python) with [this whl|https://github.com/ursa-labs/crossbow/releases/download/build-669-appveyor-wheel-win-cp37m/pyarrow-0.14.1-cp37-cp37m-win_amd64.whl] Do I miss something to reproduce this failure? {code:java} $ wget https://www.python.org/ftp/python/3.7.4/python-3.7.4-embed-amd64.zip $ unzip python-3.7.4-embed-amd64.zip $ cd python-3.7.4-embed-amd64 $ wget https://bootstrap.pypa.io/get-pip.py $ python get-pip.py $ wget pyarrow-0.14.1-cp37-cp37m-win_amd64.whl $ python -m pip install pyarrow-0.14.1-cp37-cp37m-win_amd64.whl ... Successfully installed numpy-1.17.1 pyarrow-0.14.1 six-1.12.0 $ python Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow >>> print (pyarrow.cpu_count()) 4 >>> {code} {code:java} $ activate arrow-dev $ wget pyarrow-0.14.1-cp37-cp37m-win_amd64.whl $ pip install pyarrow-0.14.1-cp37-cp37m-win_amd64.whl ... Installing collected packages: pyarrow Successfully installed pyarrow-0.14.1 >python Python 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 22:01:29) [MSC v.1900 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow >>> print (pyarrow.cpu_count()) 4 >>> {code} > [Python] pyarrow wheel: `DLL load failed` when importing on windows > > > Key: ARROW-6015 > URL: https://issues.apache.org/jira/browse/ARROW-6015 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging, Python >Affects Versions: 0.14.1 >Reporter: Ruslan Kuprieiev >Priority: Major > Labels: wheel > Fix For: 0.15.0 > > > When installing pyarrow 0.14.1 on windows 10 x64 with python 3.7, you get: > >>> import pyarrow > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified module could not be found. > On 0.14.0 everything works fine. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6399) [C++] More extensive attributes usage could improve debugging
Benjamin Kietzman created ARROW-6399: Summary: [C++] More extensive attributes usage could improve debugging Key: ARROW-6399 URL: https://issues.apache.org/jira/browse/ARROW-6399 Project: Apache Arrow Issue Type: Improvement Reporter: Benjamin Kietzman Wrapping raw or smart pointer parameters and other declarations with {{gsl::not_null}} will assert they are not null. The check is dropped for release builds. Status is tagged with ARROW_MUST_USE_RESULT which emits warnings when a Status might be ignored if compiling with clang, but Result<> should probably be tagged with this too -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (ARROW-5300) [C++] 0.13 FAILED to build with option -DARROW_NO_DEFAULT_MEMORY_POOL
[ https://issues.apache.org/jira/browse/ARROW-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-5300: - Assignee: Francois Saint-Jacques > [C++] 0.13 FAILED to build with option -DARROW_NO_DEFAULT_MEMORY_POOL > - > > Key: ARROW-5300 > URL: https://issues.apache.org/jira/browse/ARROW-5300 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.13.0 >Reporter: Weihua Jiang >Assignee: Francois Saint-Jacques >Priority: Major > Fix For: 0.15.0 > > > I tried to upgrade Apache Arrow to 0.13. But, when building Apache Arrow 0.13 > with option {{-DARROW_NO_DEFAULT_MEMORY_POOL}}, I got a lot of failures. > It seems 0.13 assuming default memory pool always available. > > My cmake command is: > |{{make .. -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_TESTS=off > -DARROW_USE_GLOG=off -DARROW_WITH_LZ4=off -DARROW_WITH_ZSTD=off > -DARROW_WITH_SNAPPY=off -DARROW_WITH_BROTLI=off -DARROW_WITH_ZLIB=off > -DARROW_JEMALLOC=off -DARROW_CXXFLAGS=-DARROW_NO_DEFAULT_MEMORY_POOL}}| > I tried to fix the compilation by adding some missing constructors. However, > it seems this issue is bigger than I expected. It seems all the builders and > appenders have this issue as many classes even don't have a memory pool > associated. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-5300) [C++] 0.13 FAILED to build with option -DARROW_NO_DEFAULT_MEMORY_POOL
[ https://issues.apache.org/jira/browse/ARROW-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5300: -- Labels: pull-request-available (was: ) > [C++] 0.13 FAILED to build with option -DARROW_NO_DEFAULT_MEMORY_POOL > - > > Key: ARROW-5300 > URL: https://issues.apache.org/jira/browse/ARROW-5300 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.13.0 >Reporter: Weihua Jiang >Assignee: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > I tried to upgrade Apache Arrow to 0.13. But, when building Apache Arrow 0.13 > with option {{-DARROW_NO_DEFAULT_MEMORY_POOL}}, I got a lot of failures. > It seems 0.13 assuming default memory pool always available. > > My cmake command is: > |{{make .. -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_TESTS=off > -DARROW_USE_GLOG=off -DARROW_WITH_LZ4=off -DARROW_WITH_ZSTD=off > -DARROW_WITH_SNAPPY=off -DARROW_WITH_BROTLI=off -DARROW_WITH_ZLIB=off > -DARROW_JEMALLOC=off -DARROW_CXXFLAGS=-DARROW_NO_DEFAULT_MEMORY_POOL}}| > I tried to fix the compilation by adding some missing constructors. However, > it seems this issue is bigger than I expected. It seems all the builders and > appenders have this issue as many classes even don't have a memory pool > associated. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Kietzman updated ARROW-3762: - Description: # When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError due to it not creating chunked arrays. Reading each row group individually and then concatenating the tables works, however. {code:java} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq x = pa.array(list('1' * 2**30)) demo = 'demo.parquet' def scenario(): t = pa.Table.from_arrays([x], ['x']) writer = pq.ParquetWriter(demo, t.schema) for i in range(2): writer.write_table(t) writer.close() pf = pq.ParquetFile(demo) # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647 t2 = pf.read() # Works, but note, there are 32 row groups, not 2 as suggested by: # https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] t3 = pa.concat_tables(tables) scenario() {code} was: When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError due to it not creating chunked arrays. Reading each row group individually and then concatenating the tables works, however. {code:java} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq x = pa.array(list('1' * 2**30)) demo = 'demo.parquet' def scenario(): t = pa.Table.from_arrays([x], ['x']) writer = pq.ParquetWriter(demo, t.schema) for i in range(2): writer.write_table(t) writer.close() pf = pq.ParquetFile(demo) # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647 t2 = pf.read() # Works, but note, there are 32 row groups, not 2 as suggested by: # https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] t3 = pa.concat_tables(tables) scenario() {code} > [C++] Parquet arrow::Table reads error when overflowing capacity of > BinaryArray > --- > > Key: ARROW-3762 > URL: https://issues.apache.org/jira/browse/ARROW-3762 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Chris Ellison >Assignee: Benjamin Kietzman >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0, 0.15.0 > > Time Spent: 8h 10m > Remaining Estimate: 0h > > # When reading a parquet file with binary data > 2 GiB, we get an > ArrowIOError due to it not creating chunked arrays. Reading each row group > individually and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6396) [C++] Add CompareOptions to Compare kernels
[ https://issues.apache.org/jira/browse/ARROW-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919757#comment-16919757 ] Wes McKinney commented on ARROW-6396: - FWIW I wasn't familiar with the "Kleene" terminology > [C++] Add CompareOptions to Compare kernels > --- > > Key: ARROW-6396 > URL: https://issues.apache.org/jira/browse/ARROW-6396 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > > This would add an enum ResolveNull \{ KLEENE_LOGIC, NULL_PROPAGATE } to > define the behavior of merging with AND/OR operators on boolean. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-3571) [Wiki] Release management guide does not explain how to set up Crossbow or where to find instructions
[ https://issues.apache.org/jira/browse/ARROW-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919756#comment-16919756 ] Wes McKinney commented on ARROW-3571: - I'm looking at the release management guide and it says "Setup crossbow as described in its README" So I think we can merge the Sphinx port of the README and then update the wiki page > [Wiki] Release management guide does not explain how to set up Crossbow or > where to find instructions > - > > Key: ARROW-3571 > URL: https://issues.apache.org/jira/browse/ARROW-3571 > Project: Apache Arrow > Issue Type: Improvement > Components: Wiki >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Major > > If you follow the guide, at one point it says "Launch a Crossbow build" but > provides no link to the setup instructions for this -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6312) [C++] Declare required Libs.private in arrow.pc package config
[ https://issues.apache.org/jira/browse/ARROW-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919753#comment-16919753 ] Wes McKinney commented on ARROW-6312: - Would you like to update your PR to make a change for 0.15.0? > [C++] Declare required Libs.private in arrow.pc package config > -- > > Key: ARROW-6312 > URL: https://issues.apache.org/jira/browse/ARROW-6312 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.1 >Reporter: Michael Maguire >Assignee: Michael Maguire >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The current arrow.pc package config file produced is deficient and doesn't > properly declare static libraries pre-requisities that must be linked in in > order to *statically* link in libarrow.a > Currently it just has: > ``` > Libs: -L${libdir} -larrow > ``` > But in cases, e.g. where you enabled snappy, brotli or zlib support in arrow, > our toolchains need to see an arrow.pc file something more like: > ``` > Libs: -L${libdir} -larrow > Libs.private: -lsnappy -lboost_system -lz -llz4 -lbrotlidec -lbrotlienc > -lbrotlicommon -lzstd > ``` > If not, we get linkage errors. I'm told the convention is that if the .a has > an UNDEF, the Requires.private plus the Libs.private should resolve all the > undefs. See the Libs.private info in [https://linux.die.net/man/1/pkg-config] > > Note, however, as Sutou Kouhei pointed out in > [https://github.com/apache/arrow/pull/5123#issuecomment-522771452,] the > additional Libs.private need to be dynamically generated based on whether > functionality like snappy, brotli or zlib is enabled.. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight
[ https://issues.apache.org/jira/browse/ARROW-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919745#comment-16919745 ] Wes McKinney commented on ARROW-6390: - I'll try to put together a documentation skeleton for using Flight from Python > [Python][Flight] Add Python documentation / tutorial for Flight > --- > > Key: ARROW-6390 > URL: https://issues.apache.org/jira/browse/ARROW-6390 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > There is no Sphinx documentation for using Flight from Python. I have found > that writing documentation is an effective way to uncover usability problems > -- I would suggest we write comprehensive documentation for using Flight from > Python as a way to refine the public Python API -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight
[ https://issues.apache.org/jira/browse/ARROW-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6390: --- Assignee: Wes McKinney > [Python][Flight] Add Python documentation / tutorial for Flight > --- > > Key: ARROW-6390 > URL: https://issues.apache.org/jira/browse/ARROW-6390 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > There is no Sphinx documentation for using Flight from Python. I have found > that writing documentation is an effective way to uncover usability problems > -- I would suggest we write comprehensive documentation for using Flight from > Python as a way to refine the public Python API -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5995) [Python] pyarrow: hdfs: support file checksum
[ https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919742#comment-16919742 ] Wes McKinney commented on ARROW-5995: - Can you invoke {{hdfs dfs -checksum}} using a system call to obtain the value? It would only work if the {{hdfs}} CLI tool is configured correctly to access your cluster > [Python] pyarrow: hdfs: support file checksum > - > > Key: ARROW-5995 > URL: https://issues.apache.org/jira/browse/ARROW-5995 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Ruslan Kuprieiev >Priority: Minor > > I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop > fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in > hadoop CLI [1], looks like we will also need to implement it manually in > pyarrow. Please correct me if I'm missing something. Is this feature > desirable? Or was there a good reason why it wasn't implemented already? > [1] > [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6398) [C++] consolidate ScanOptions and ScanContext
[ https://issues.apache.org/jira/browse/ARROW-6398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6398: -- Labels: dataset pull-request-available (was: dataset) > [C++] consolidate ScanOptions and ScanContext > - > > Key: ARROW-6398 > URL: https://issues.apache.org/jira/browse/ARROW-6398 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Benjamin Kietzman >Assignee: Benjamin Kietzman >Priority: Minor > Labels: dataset, pull-request-available > > Currently ScanOptions has two distinct responsibilities: it contains the data > selector (and eventually projection schema) for the current scan and it > serves as the base class for format specific scan options. > In addition, we have ScanContext which holds the memory pool for the current > scan. > I think these classes should be rearranged as follows: ScanOptions will be > removed and FileScanOptions will be the abstract base class for format > specific scan options. ScanContext will be a concrete struct and contain the > data selector, projection schema, a vector of FileScanOptions, and any other > shared scan state. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5922) [Python] Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API
[ https://issues.apache.org/jira/browse/ARROW-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919722#comment-16919722 ] Ben Schreck commented on ARROW-5922: I am getting the same error as you. I noticed that the error for me is in java- under the hood pyarrow tries to load the HDFS java class and can't find it. I can't figure out how to fix it though... > [Python] Unable to connect to HDFS from a worker/data node on a Kerberized > cluster using pyarrow' hdfs API > -- > > Key: ARROW-5922 > URL: https://issues.apache.org/jira/browse/ARROW-5922 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Unix >Reporter: Saurabh Bajaj >Priority: Major > Fix For: 0.14.0 > > > Here's what I'm trying: > {{```}} > {{import pyarrow as pa }} > {{conf = \{"hadoop.security.authentication": "kerberos"} }} > {{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}} > {{```}} > However, when I submit this job to the cluster using {{Dask-YARN}}, I get the > following error: > ``` > {{File "test/run.py", line 3 fs = > pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File > "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 211, in connect File > "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in > pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}} > {{```}} > I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however > I run into the same error. Since the error is not descriptive, I'm not sure > which setting needs to be altered. Any clues anyone? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6231) [Python] Consider assigning default column names when reading CSV file and header_rows=0
[ https://issues.apache.org/jira/browse/ARROW-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-6231. --- Resolution: Fixed Issue resolved by pull request 5206 [https://github.com/apache/arrow/pull/5206] > [Python] Consider assigning default column names when reading CSV file and > header_rows=0 > > > Key: ARROW-6231 > URL: https://issues.apache.org/jira/browse/ARROW-6231 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: csv, pull-request-available > Fix For: 0.15.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > This is a slight usability rough edge. Assigning default names (like "f0, f1, > ...") would probably be better since then at least you can see how many > columns there are and what is in them. > {code} > In [10]: parse_options = csv.ParseOptions(delimiter='|', header_rows=0) > > > In [11]: %time table = csv.read_csv('Performance_2016Q4.txt', > parse_options=parse_options) > > --- > ArrowInvalid Traceback (most recent call last) > in > ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/_csv.pyx in > pyarrow._csv.read_csv() > ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status() > ArrowInvalid: header_rows == 0 needs explicit column names > {code} > In pandas integers are used, so some kind of default string would have to be > defined > {code} > In [18]: df = pd.read_csv('Performance_2016Q4.txt', sep='|', header=None, > low_memory=False) > > In [19]: df.columns > > > Out[19]: > Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, > 16, > 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], >dtype='int64') > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Closed] (ARROW-6380) Method pyarrow.parquet.read_table has memory spikes from version 0.14
[ https://issues.apache.org/jira/browse/ARROW-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-6380. --- > Method pyarrow.parquet.read_table has memory spikes from version 0.14 > - > > Key: ARROW-6380 > URL: https://issues.apache.org/jira/browse/ARROW-6380 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.0, 0.14.1 > Environment: ubuntu 18, 16GB ram, 4 cpus >Reporter: Renan Alves Fonseca >Priority: Major > > Method pyarrow.parquet.read_table is very slow and cause RAM spikes from > version 0.14.0 > Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 > and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x > This impact in performance is easily measured. However, there is another > problem that I could only detect on htop screen. While opening a 40MB > parquet, the process occupies almost 16GB for some miliseconds. The pyarrow > table will result in around 300MB in the python process (registered using > memory-profiler). This does not happens in versions 0.13 and previous ones. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6398) [C++] consolidate ScanOptions and ScanContext
Benjamin Kietzman created ARROW-6398: Summary: [C++] consolidate ScanOptions and ScanContext Key: ARROW-6398 URL: https://issues.apache.org/jira/browse/ARROW-6398 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Benjamin Kietzman Assignee: Benjamin Kietzman Currently ScanOptions has two distinct responsibilities: it contains the data selector (and eventually projection schema) for the current scan and it serves as the base class for format specific scan options. In addition, we have ScanContext which holds the memory pool for the current scan. I think these classes should be rearranged as follows: ScanOptions will be removed and FileScanOptions will be the abstract base class for format specific scan options. ScanContext will be a concrete struct and contain the data selector, projection schema, a vector of FileScanOptions, and any other shared scan state. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6397) [C++][CI] Fix S3 minio failure
[ https://issues.apache.org/jira/browse/ARROW-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6397: -- Labels: pull-request-available (was: ) > [C++][CI] Fix S3 minio failure > -- > > Key: ARROW-6397 > URL: https://issues.apache.org/jira/browse/ARROW-6397 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Continuous Integration >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > > See > [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27065941/job/gwjmr2hudm7693ef] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6397) [C++][CI] Fix S3 minio failure
[ https://issues.apache.org/jira/browse/ARROW-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919623#comment-16919623 ] Francois Saint-Jacques commented on ARROW-6397: --- I think the non-empty directory is not affecting the test. The bind error is the real issue. > [C++][CI] Fix S3 minio failure > -- > > Key: ARROW-6397 > URL: https://issues.apache.org/jira/browse/ARROW-6397 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Continuous Integration >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > > See > [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27065941/job/gwjmr2hudm7693ef] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5618) [C++] [Parquet] Using deprecated Int96 storage for timestamps triggers integer overflow in some cases
[ https://issues.apache.org/jira/browse/ARROW-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919611#comment-16919611 ] TP Boudreau commented on ARROW-5618: (Sorry if this is a duplicate comment -- first attempt doesn't seem to have posted.) This is issue fell through the cracks on my end -- I'll look into it this weekend. > [C++] [Parquet] Using deprecated Int96 storage for timestamps triggers > integer overflow in some cases > - > > Key: ARROW-5618 > URL: https://issues.apache.org/jira/browse/ARROW-5618 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: TP Boudreau >Assignee: TP Boudreau >Priority: Minor > Labels: parquet, pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > When storing Arrow timestamps in Parquet files using the Int96 storage > format, certain combinations of array lengths and validity bitmasks cause an > integer overflow error on read. It's not immediately clear whether the > Arrow/Parquet writer is storing zeroes when it should be storing positive > values or the reader is attempting to calculate a nanoseconds value > inappropriately from zeroed inputs (perhaps missing the null bit flag). Also > not immediately clear why only certain length columns seem to be affected. > Probably the quickest way to reproduce this undefined behavior is to alter > the existing unit test UseDeprecatedInt96 (in file > .../arrow/cpp/src/parquet/arrow/arrow-reader-writer-test.cc) by quadrupling > its column lengths (repeating the same values), followed by 'make unittest' > using clang-7 with sanitizers enabled. (Here's a patch applicable to current > master that changes the test as described: [1]; I used the following cmake > command to build my environment: [2].) You should get a log something like > [3]. If requested, I'll see if I can put together a stand-alone minimal test > case that induces the behavior. > The quick-hack at [4] will prevent integer overflows, but this is only > included to confirm the proximate cause of the bug: the Julian days field of > the Int96 appears to be zero, when a strictly positive number is expected. > I've assigned the issue to myself and I'll start looking into the root cause > of this. > [1] https://gist.github.com/tpboudreau/b6610c13cbfede4d6b171da681d1f94e > [2] https://gist.github.com/tpboudreau/59178ca8cb50a935aab7477805aa32b9 > [3] https://gist.github.com/tpboudreau/0c2d0a18960c1aa04c838fa5c2ac7d2d > [4] https://gist.github.com/tpboudreau/0993beb5c8c1488028e76fb2ca179b7f -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (ARROW-6397) [C++][CI] Fix S3 minio failure
[ https://issues.apache.org/jira/browse/ARROW-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-6397: - Assignee: Francois Saint-Jacques > [C++][CI] Fix S3 minio failure > -- > > Key: ARROW-6397 > URL: https://issues.apache.org/jira/browse/ARROW-6397 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Continuous Integration >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > > See > [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27065941/job/gwjmr2hudm7693ef] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6368) [C++] Add RecordBatch projection functionality
[ https://issues.apache.org/jira/browse/ARROW-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6368: -- Labels: dataset pull-request-available (was: dataset) > [C++] Add RecordBatch projection functionality > -- > > Key: ARROW-6368 > URL: https://issues.apache.org/jira/browse/ARROW-6368 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Benjamin Kietzman >Assignee: Benjamin Kietzman >Priority: Minor > Labels: dataset, pull-request-available > > define classes RecordBatchProjector (which projects from one schema to > another, augmenting with null/constant columns where necessary) and a subtype > of RecordBatchIterator which projects each batch yielded by a wrapped > iterator. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6380) Method pyarrow.parquet.read_table has memory spikes from version 0.14
[ https://issues.apache.org/jira/browse/ARROW-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6380: --- Fix Version/s: (was: 0.13.0) > Method pyarrow.parquet.read_table has memory spikes from version 0.14 > - > > Key: ARROW-6380 > URL: https://issues.apache.org/jira/browse/ARROW-6380 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.0, 0.14.1 > Environment: ubuntu 18, 16GB ram, 4 cpus >Reporter: Renan Alves Fonseca >Priority: Major > > Method pyarrow.parquet.read_table is very slow and cause RAM spikes from > version 0.14.0 > Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 > and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x > This impact in performance is easily measured. However, there is another > problem that I could only detect on htop screen. While opening a 40MB > parquet, the process occupies almost 16GB for some miliseconds. The pyarrow > table will result in around 300MB in the python process (registered using > memory-profiler). This does not happens in versions 0.13 and previous ones. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6380) Method pyarrow.parquet.read_table has memory spikes from version 0.14
[ https://issues.apache.org/jira/browse/ARROW-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-6380. Resolution: Duplicate Thanks. This was fixed in ARROW-6060. Please reopen if you find this is still a problem in the code on master in the apache/arrow repository. > Method pyarrow.parquet.read_table has memory spikes from version 0.14 > - > > Key: ARROW-6380 > URL: https://issues.apache.org/jira/browse/ARROW-6380 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.0, 0.14.1 > Environment: ubuntu 18, 16GB ram, 4 cpus >Reporter: Renan Alves Fonseca >Priority: Major > Fix For: 0.13.0 > > > Method pyarrow.parquet.read_table is very slow and cause RAM spikes from > version 0.14.0 > Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 > and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x > This impact in performance is easily measured. However, there is another > problem that I could only detect on htop screen. While opening a 40MB > parquet, the process occupies almost 16GB for some miliseconds. The pyarrow > table will result in around 300MB in the python process (registered using > memory-profiler). This does not happens in versions 0.13 and previous ones. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6387) [Archery] Errors with make
[ https://issues.apache.org/jira/browse/ARROW-6387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-6387. --- Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 5234 [https://github.com/apache/arrow/pull/5234] > [Archery] Errors with make > -- > > Key: ARROW-6387 > URL: https://issues.apache.org/jira/browse/ARROW-6387 > Project: Apache Arrow > Issue Type: Bug > Components: Archery >Reporter: Omer Ozarslan >Assignee: Omer Ozarslan >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > {{archery --debug benchmark run}} gives error on Debian 10, CMake 3.13.4, GNU > make 4.2.1: > {code:java} > (.venv) omer@omer ~/src/ext/arrow/cpp/build master ● archery --debug > benchmark run > > DEBUG:archery:Running benchmark WORKSPACE > > > DEBUG:archery:Executing `['/usr/bin/cmake', '-GMake', > '-DCMAKE_EXPORT_COMPILE_COMMANDS=ON', '-DCMAKE_BUILD_TYPE=release', > '-DBUILD_WARNING_LEVEL=production', '-DARROW_BUILD_TESTS=ON', > '-DARROW_BUILD_BENCHMARKS=ON', '-DARROW_PYTHON=OFF', '-DARROW_PARQUET=OFF', > '-DARROW_GANDIVA=OFF', '-DARROW_PLASMA=OFF', '-DARROW_FLIGHT=OFF', > '/home/omer/src/ext/arrow/cpp']` > CMake Error: Could not create named generator Make > > > > Generators > > Unix Makefiles = Generates standard UNIX makefiles. > > > Ninja= Generates build.ninja files. > > > Watcom WMake = Generates Watcom WMake makefiles. > > > CodeBlocks - Ninja = Generates CodeBlocks project files. > > > CodeBlocks - Unix Makefiles = Generates CodeBlocks project files. > > > CodeLite - Ninja = Generates CodeLite project files. > > CodeLite - Unix Makefiles= Generates CodeLite project files. > > Sublime Text 2 - Ninja = Generates Sublime Text 2 project files. > > Sublime Text 2 - Unix Makefiles >= Generates Sublime Text 2 project files. > > Kate - Ninja = Generates Kate project files. > > Kate - Unix Makefiles= Generates Kate project files. > Eclipse CDT4 - Ninja = Generates Eclipse CDT 4.0 project files. > Eclipse CDT4 - Unix Makefiles= Generates Eclipse CDT 4.0 project files. > Traceback (most recent call last): > [[[cropped]]]{code} > After trivial fix: > {code:java} > diff --git a/dev/archery/archery/utils/cmake.py > b/dev/archery/archery/utils/cmake.py > index 38aedab2d..3150ea9a6 100644 > --- a/dev/archery/archery/utils/cmake.py > +++ b/dev/archery/archery/utils/cmake.py > @@ -34,7 +34,7 @@ class CMake(Command): > in the search path. > """ > found_ninja = which("ninja") > -return "Ninja" if found_ninja else "Make" > +return "Ninja" if found_ninja else "Unix Makefiles"{code} > I get another error: > {code:java} > [[[cropped]] > -- Generating done > -- Build files have been written to: /tmp/arrow-bench-48x_yleb/WORKSPACE/build > DEBUG:archery:Executing `[None]` > Traceback (most recent call last): > File "/home/omer/src/ext/arrow/.venv/bin/archery", line
[jira] [Assigned] (ARROW-6387) [Archery] Errors with make
[ https://issues.apache.org/jira/browse/ARROW-6387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-6387: - Assignee: Omer Ozarslan > [Archery] Errors with make > -- > > Key: ARROW-6387 > URL: https://issues.apache.org/jira/browse/ARROW-6387 > Project: Apache Arrow > Issue Type: Bug >Reporter: Omer Ozarslan >Assignee: Omer Ozarslan >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > {{archery --debug benchmark run}} gives error on Debian 10, CMake 3.13.4, GNU > make 4.2.1: > {code:java} > (.venv) omer@omer ~/src/ext/arrow/cpp/build master ● archery --debug > benchmark run > > DEBUG:archery:Running benchmark WORKSPACE > > > DEBUG:archery:Executing `['/usr/bin/cmake', '-GMake', > '-DCMAKE_EXPORT_COMPILE_COMMANDS=ON', '-DCMAKE_BUILD_TYPE=release', > '-DBUILD_WARNING_LEVEL=production', '-DARROW_BUILD_TESTS=ON', > '-DARROW_BUILD_BENCHMARKS=ON', '-DARROW_PYTHON=OFF', '-DARROW_PARQUET=OFF', > '-DARROW_GANDIVA=OFF', '-DARROW_PLASMA=OFF', '-DARROW_FLIGHT=OFF', > '/home/omer/src/ext/arrow/cpp']` > CMake Error: Could not create named generator Make > > > > Generators > > Unix Makefiles = Generates standard UNIX makefiles. > > > Ninja= Generates build.ninja files. > > > Watcom WMake = Generates Watcom WMake makefiles. > > > CodeBlocks - Ninja = Generates CodeBlocks project files. > > > CodeBlocks - Unix Makefiles = Generates CodeBlocks project files. > > > CodeLite - Ninja = Generates CodeLite project files. > > CodeLite - Unix Makefiles= Generates CodeLite project files. > > Sublime Text 2 - Ninja = Generates Sublime Text 2 project files. > > Sublime Text 2 - Unix Makefiles >= Generates Sublime Text 2 project files. > > Kate - Ninja = Generates Kate project files. > > Kate - Unix Makefiles= Generates Kate project files. > Eclipse CDT4 - Ninja = Generates Eclipse CDT 4.0 project files. > Eclipse CDT4 - Unix Makefiles= Generates Eclipse CDT 4.0 project files. > Traceback (most recent call last): > [[[cropped]]]{code} > After trivial fix: > {code:java} > diff --git a/dev/archery/archery/utils/cmake.py > b/dev/archery/archery/utils/cmake.py > index 38aedab2d..3150ea9a6 100644 > --- a/dev/archery/archery/utils/cmake.py > +++ b/dev/archery/archery/utils/cmake.py > @@ -34,7 +34,7 @@ class CMake(Command): > in the search path. > """ > found_ninja = which("ninja") > -return "Ninja" if found_ninja else "Make" > +return "Ninja" if found_ninja else "Unix Makefiles"{code} > I get another error: > {code:java} > [[[cropped]] > -- Generating done > -- Build files have been written to: /tmp/arrow-bench-48x_yleb/WORKSPACE/build > DEBUG:archery:Executing `[None]` > Traceback (most recent call last): > File "/home/omer/src/ext/arrow/.venv/bin/archery", line 11, in > load_entry_point('archery', 'console_scripts', 'archery')() > File > "/home/omer/src/ext/arrow/.venv/lib/python3.7/site-packages/click/core.py",
[jira] [Updated] (ARROW-6387) [Archery] Errors with make
[ https://issues.apache.org/jira/browse/ARROW-6387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6387: -- Component/s: Archery > [Archery] Errors with make > -- > > Key: ARROW-6387 > URL: https://issues.apache.org/jira/browse/ARROW-6387 > Project: Apache Arrow > Issue Type: Bug > Components: Archery >Reporter: Omer Ozarslan >Assignee: Omer Ozarslan >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > {{archery --debug benchmark run}} gives error on Debian 10, CMake 3.13.4, GNU > make 4.2.1: > {code:java} > (.venv) omer@omer ~/src/ext/arrow/cpp/build master ● archery --debug > benchmark run > > DEBUG:archery:Running benchmark WORKSPACE > > > DEBUG:archery:Executing `['/usr/bin/cmake', '-GMake', > '-DCMAKE_EXPORT_COMPILE_COMMANDS=ON', '-DCMAKE_BUILD_TYPE=release', > '-DBUILD_WARNING_LEVEL=production', '-DARROW_BUILD_TESTS=ON', > '-DARROW_BUILD_BENCHMARKS=ON', '-DARROW_PYTHON=OFF', '-DARROW_PARQUET=OFF', > '-DARROW_GANDIVA=OFF', '-DARROW_PLASMA=OFF', '-DARROW_FLIGHT=OFF', > '/home/omer/src/ext/arrow/cpp']` > CMake Error: Could not create named generator Make > > > > Generators > > Unix Makefiles = Generates standard UNIX makefiles. > > > Ninja= Generates build.ninja files. > > > Watcom WMake = Generates Watcom WMake makefiles. > > > CodeBlocks - Ninja = Generates CodeBlocks project files. > > > CodeBlocks - Unix Makefiles = Generates CodeBlocks project files. > > > CodeLite - Ninja = Generates CodeLite project files. > > CodeLite - Unix Makefiles= Generates CodeLite project files. > > Sublime Text 2 - Ninja = Generates Sublime Text 2 project files. > > Sublime Text 2 - Unix Makefiles >= Generates Sublime Text 2 project files. > > Kate - Ninja = Generates Kate project files. > > Kate - Unix Makefiles= Generates Kate project files. > Eclipse CDT4 - Ninja = Generates Eclipse CDT 4.0 project files. > Eclipse CDT4 - Unix Makefiles= Generates Eclipse CDT 4.0 project files. > Traceback (most recent call last): > [[[cropped]]]{code} > After trivial fix: > {code:java} > diff --git a/dev/archery/archery/utils/cmake.py > b/dev/archery/archery/utils/cmake.py > index 38aedab2d..3150ea9a6 100644 > --- a/dev/archery/archery/utils/cmake.py > +++ b/dev/archery/archery/utils/cmake.py > @@ -34,7 +34,7 @@ class CMake(Command): > in the search path. > """ > found_ninja = which("ninja") > -return "Ninja" if found_ninja else "Make" > +return "Ninja" if found_ninja else "Unix Makefiles"{code} > I get another error: > {code:java} > [[[cropped]] > -- Generating done > -- Build files have been written to: /tmp/arrow-bench-48x_yleb/WORKSPACE/build > DEBUG:archery:Executing `[None]` > Traceback (most recent call last): > File "/home/omer/src/ext/arrow/.venv/bin/archery", line 11, in > load_entry_point('archery', 'console_scripts', 'archery')() > File >
[jira] [Created] (ARROW-6397) [C++][CI] Fix S3 minio failure
Francois Saint-Jacques created ARROW-6397: - Summary: [C++][CI] Fix S3 minio failure Key: ARROW-6397 URL: https://issues.apache.org/jira/browse/ARROW-6397 Project: Apache Arrow Issue Type: New Feature Components: C++, Continuous Integration Reporter: Francois Saint-Jacques See [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27065941/job/gwjmr2hudm7693ef] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6341) [Python] Implement low-level bindings for Dataset
[ https://issues.apache.org/jira/browse/ARROW-6341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6341: -- Labels: dataset pull-request-available (was: dataset) > [Python] Implement low-level bindings for Dataset > - > > Key: ARROW-6341 > URL: https://issues.apache.org/jira/browse/ARROW-6341 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Francois Saint-Jacques >Assignee: Krisztian Szucs >Priority: Major > Labels: dataset, pull-request-available > > The following classes should be accessible from Python: > * class DataSource > * class DataFragment > * function DiscoverySource > * class ScanContext, ScanOptions, ScanTask > * class Dataset > * class ScannerBuilder > * class Scanner > The end result is reading a directory of parquet files as a single stream. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6341) [Python] Implement low-level bindings for Dataset
[ https://issues.apache.org/jira/browse/ARROW-6341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-6341: --- Summary: [Python] Implement low-level bindings for Dataset (was: [Python] Implements low-level bindings to Dataset classes:) > [Python] Implement low-level bindings for Dataset > - > > Key: ARROW-6341 > URL: https://issues.apache.org/jira/browse/ARROW-6341 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Francois Saint-Jacques >Assignee: Krisztian Szucs >Priority: Major > Labels: dataset > > The following classes should be accessible from Python: > * class DataSource > * class DataFragment > * function DiscoverySource > * class ScanContext, ScanOptions, ScanTask > * class Dataset > * class ScannerBuilder > * class Scanner > The end result is reading a directory of parquet files as a single stream. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6344) [C++][Gandiva] substring does not handle multibyte characters
[ https://issues.apache.org/jira/browse/ARROW-6344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6344: -- Labels: pull-request-available (was: ) > [C++][Gandiva] substring does not handle multibyte characters > - > > Key: ARROW-6344 > URL: https://issues.apache.org/jira/browse/ARROW-6344 > Project: Apache Arrow > Issue Type: Bug >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6396) [C++] Add CompareOptions to Compare kernels
[ https://issues.apache.org/jira/browse/ARROW-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6396: -- Description: This would add an enum ResolveNull \{ KLEENE_LOGIC, NULL_PROPAGATE } to define the behavior of merging with AND/OR operators on boolean. (was: This would add an enum ResolveNull \{ KLEENE_LOGIC, NULL_PROPAGATE }.) > [C++] Add CompareOptions to Compare kernels > --- > > Key: ARROW-6396 > URL: https://issues.apache.org/jira/browse/ARROW-6396 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > > This would add an enum ResolveNull \{ KLEENE_LOGIC, NULL_PROPAGATE } to > define the behavior of merging with AND/OR operators on boolean. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6396) [C++] Add CompareOptions to Compare kernels
Francois Saint-Jacques created ARROW-6396: - Summary: [C++] Add CompareOptions to Compare kernels Key: ARROW-6396 URL: https://issues.apache.org/jira/browse/ARROW-6396 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Francois Saint-Jacques This would add an enum ResolveNull \{ KLEENE_LOGIC, NULL_PROPAGATE }. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-3571) [Wiki] Release management guide does not explain how to set up Crossbow or where to find instructions
[ https://issues.apache.org/jira/browse/ARROW-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919388#comment-16919388 ] Krisztian Szucs commented on ARROW-3571: Yeah, I've moved the crossbow README to sphinx. Do you mean the whole release management guide? > [Wiki] Release management guide does not explain how to set up Crossbow or > where to find instructions > - > > Key: ARROW-3571 > URL: https://issues.apache.org/jira/browse/ARROW-3571 > Project: Apache Arrow > Issue Type: Improvement > Components: Wiki >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Major > > If you follow the guide, at one point it says "Launch a Crossbow build" but > provides no link to the setup instructions for this -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6395) [pyarrow] Bug when using bool arrays with stride greater than 1
[ https://issues.apache.org/jira/browse/ARROW-6395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919377#comment-16919377 ] Igor Yastrebov commented on ARROW-6395: --- [~jorisvandenbossche] is this solved by [ARROW-6325|https://issues.apache.org/jira/browse/ARROW-6325]? > [pyarrow] Bug when using bool arrays with stride greater than 1 > --- > > Key: ARROW-6395 > URL: https://issues.apache.org/jira/browse/ARROW-6395 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: Philip Felton >Priority: Major > > Here's code to reproduce it: > {code:python} > >>> import numpy as np > >>> import pyarrow as pa > >>> pa.__version__ > '0.14.0' > >>> xs = np.array([True, False, False, True, True, False, True, True, True, > >>> False, False, False, False, False, True, False, True, True, True, True, > >>> True]) > >>> xs_sliced = xs[0::2] > >>> xs_sliced > array([ True, False, True, True, True, False, False, True, True, > True, True]) > >>> pa_xs = pa.array(xs_sliced, pa.bool_()) > >>> pa_xs > > [ > true, > false, > false, > false, > false, > false, > false, > false, > false, > false, > false > ]{code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6395) [pyarrow] Bug when using bool arrays with stride greater than 1
Philip Felton created ARROW-6395: Summary: [pyarrow] Bug when using bool arrays with stride greater than 1 Key: ARROW-6395 URL: https://issues.apache.org/jira/browse/ARROW-6395 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.14.0 Reporter: Philip Felton Here's code to reproduce it: {code:python} >>> import numpy as np >>> import pyarrow as pa >>> pa.__version__ '0.14.0' >>> xs = np.array([True, False, False, True, True, False, True, True, True, >>> False, False, False, False, False, True, False, True, True, True, True, >>> True]) >>> xs_sliced = xs[0::2] >>> xs_sliced array([ True, False, True, True, True, False, False, True, True, True, True]) >>> pa_xs = pa.array(xs_sliced, pa.bool_()) >>> pa_xs [ true, false, false, false, false, false, false, false, false, false, false ]{code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6380) Method pyarrow.parquet.read_table has memory spikes from version 0.14
[ https://issues.apache.org/jira/browse/ARROW-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919370#comment-16919370 ] Igor Yastrebov commented on ARROW-6380: --- Is it a duplicate of [ARROW-6059|https://issues.apache.org/jira/browse/ARROW-6059]? > Method pyarrow.parquet.read_table has memory spikes from version 0.14 > - > > Key: ARROW-6380 > URL: https://issues.apache.org/jira/browse/ARROW-6380 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.0, 0.14.1 > Environment: ubuntu 18, 16GB ram, 4 cpus >Reporter: Renan Alves Fonseca >Priority: Major > Fix For: 0.13.0 > > > Method pyarrow.parquet.read_table is very slow and cause RAM spikes from > version 0.14.0 > Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 > and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x > This impact in performance is easily measured. However, there is another > problem that I could only detect on htop screen. While opening a 40MB > parquet, the process occupies almost 16GB for some miliseconds. The pyarrow > table will result in around 300MB in the python process (registered using > memory-profiler). This does not happens in versions 0.13 and previous ones. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6144) [C++][Gandiva] Implement random function in Gandiva
[ https://issues.apache.org/jira/browse/ARROW-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen Kumar Desabandu resolved ARROW-6144. Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 5022 [https://github.com/apache/arrow/pull/5022] > [C++][Gandiva] Implement random function in Gandiva > --- > > Key: ARROW-6144 > URL: https://issues.apache.org/jira/browse/ARROW-6144 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Implement random(), random(int seed) functions. The values are sampled from a > uniform distribution(0, 1) The random values for each row of a column are > generated from same generator which is initialised at (function) build time. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6394) [Java] Support conversions between delta vector and partial sum vector
[ https://issues.apache.org/jira/browse/ARROW-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6394: -- Labels: pull-request-available (was: ) > [Java] Support conversions between delta vector and partial sum vector > -- > > Key: ARROW-6394 > URL: https://issues.apache.org/jira/browse/ARROW-6394 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > > What is a delta vector/partial sum vector? > Given an integer vector a with length n, its partial sum vector is another > integer vector b with length n + 1, with values defined as: > b(0) = initial sum > b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n > Given an integer vector with length n + 1, its delta vector is another > integer vector b with length n, with values defined as: > b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1 > In this issue, we provide utilities to convert between vector and partial sum > vector. It is interesting to note that the two operations corresponding to > the discrete integration and differentian. > These conversions have wide applications. For example, > 1. The run-length vector proposed by Micah is based on the partial sum > vector, while the deduplication functionality is based on delta vector. This > issue provides conversions between them. > 2. The current VarCharVector/VarBinaryVector implementations are based on > partial sum vector. We can transform them to delta vectors before IPC, to > reduce network traffic. > 3. Converting to delta can be considered as a way for data compression. To > further reduce the data volume, the operation can be applied more than once, > to further reduce data volume. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5995) [Python] pyarrow: hdfs: support file checksum
[ https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919311#comment-16919311 ] Ruslan Kuprieiev commented on ARROW-5995: - Btw, [~wesmckinn] [~npr] , what are your thoughts on this? > [Python] pyarrow: hdfs: support file checksum > - > > Key: ARROW-5995 > URL: https://issues.apache.org/jira/browse/ARROW-5995 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Ruslan Kuprieiev >Priority: Minor > > I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop > fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in > hadoop CLI [1], looks like we will also need to implement it manually in > pyarrow. Please correct me if I'm missing something. Is this feature > desirable? Or was there a good reason why it wasn't implemented already? > [1] > [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5995) [Python] pyarrow: hdfs: support file checksum
[ https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919309#comment-16919309 ] Ruslan Kuprieiev commented on ARROW-5995: - You are right, such a hackish approach would probably not pass the reviews. But it might be a good temporary pure-python workaround if parsing those metafiles is comparatively simple, so we don't have to mess around with our own C library, for which we would have to ship wheels (which is a hustle). And having that workaround, we could submit and patiently wait for proper patches to get merged into libhdfs and pyarrow. If the workaround is hard to implement, then we could skip it and keep using hadoop CLI as we do right now, focusing on proper patches to libhdfs and pyarrow. What do you think? :) > [Python] pyarrow: hdfs: support file checksum > - > > Key: ARROW-5995 > URL: https://issues.apache.org/jira/browse/ARROW-5995 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Ruslan Kuprieiev >Priority: Minor > > I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop > fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in > hadoop CLI [1], looks like we will also need to implement it manually in > pyarrow. Please correct me if I'm missing something. Is this feature > desirable? Or was there a good reason why it wasn't implemented already? > [1] > [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6394) [Java] Support conversions between delta vector and partial sum vector
[ https://issues.apache.org/jira/browse/ARROW-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liya Fan updated ARROW-6394: Description: What is a delta vector/partial sum vector? Given an integer vector a with length n, its partial sum vector is another integer vector b with length n + 1, with values defined as: b(0) = initial sum b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n Given an integer vector with length n + 1, its delta vector is another integer vector b with length n, with values defined as: b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1 In this issue, we provide utilities to convert between vector and partial sum vector. It is interesting to note that the two operations corresponding to the discrete integration and differentian. These conversions have wide applications. For example, 1. The run-length vector proposed by Micah is based on the partial sum vector, while the deduplication functionality is based on delta vector. This issue provides conversions between them. 2. The current VarCharVector/VarBinaryVector implementations are based on partial sum vector. We can transform them to delta vectors before IPC, to reduce network traffic. 3. Converting to delta can be considered as a way for data compression. To further reduce the data volume, the operation can be applied more than once, to further reduce data volume. was: What is a delta vector/partial sum vector? Given an integer vector a with length n, its partial sum vector is another integer vector b with length n + 1, with values defined as: b(0) = initial sum b(i) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n Given an integer vector with length n + 1, its delta vector is another integer vector b with length n, with values defined as: b(i) = a(i) - a(i - 1), i = 0, 1, ... , n -1 In this issue, we provide utilities to convert between vector and partial sum vector. It is interesting to note that the two operations corresponding to the discrete integration and differentian. These conversions have wide applications. For example, 1. The run-length vector proposed by Micah is based on the partial sum vector, while the deduplication functionality is based on delta vector. This issue provides conversions between them. 2. The current VarCharVector/VarBinaryVector implementations are based on partial sum vector. We can transform them to delta vectors before IPC, to reduce network traffic. 3. Converting to delta can be considered as a way for data compression. To further reduce the data volume, the operation can be applied more than once, to further reduce data volume. > [Java] Support conversions between delta vector and partial sum vector > -- > > Key: ARROW-6394 > URL: https://issues.apache.org/jira/browse/ARROW-6394 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > > What is a delta vector/partial sum vector? > Given an integer vector a with length n, its partial sum vector is another > integer vector b with length n + 1, with values defined as: > b(0) = initial sum > b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n > Given an integer vector with length n + 1, its delta vector is another > integer vector b with length n, with values defined as: > b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1 > In this issue, we provide utilities to convert between vector and partial sum > vector. It is interesting to note that the two operations corresponding to > the discrete integration and differentian. > These conversions have wide applications. For example, > 1. The run-length vector proposed by Micah is based on the partial sum > vector, while the deduplication functionality is based on delta vector. This > issue provides conversions between them. > 2. The current VarCharVector/VarBinaryVector implementations are based on > partial sum vector. We can transform them to delta vectors before IPC, to > reduce network traffic. > 3. Converting to delta can be considered as a way for data compression. To > further reduce the data volume, the operation can be applied more than once, > to further reduce data volume. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6394) [Java] Support conversions between delta vector and partial sum vector
Liya Fan created ARROW-6394: --- Summary: [Java] Support conversions between delta vector and partial sum vector Key: ARROW-6394 URL: https://issues.apache.org/jira/browse/ARROW-6394 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan What is a delta vector/partial sum vector? Given an integer vector a with length n, its partial sum vector is another integer vector b with length n + 1, with values defined as: b(0) = initial sum b(i) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n Given an integer vector with length n + 1, its delta vector is another integer vector b with length n, with values defined as: b(i) = a(i) - a(i - 1), i = 0, 1, ... , n -1 In this issue, we provide utilities to convert between vector and partial sum vector. It is interesting to note that the two operations corresponding to the discrete integration and differentian. These conversions have wide applications. For example, 1. The run-length vector proposed by Micah is based on the partial sum vector, while the deduplication functionality is based on delta vector. This issue provides conversions between them. 2. The current VarCharVector/VarBinaryVector implementations are based on partial sum vector. We can transform them to delta vectors before IPC, to reduce network traffic. 3. Converting to delta can be considered as a way for data compression. To further reduce the data volume, the operation can be applied more than once, to further reduce data volume. -- This message was sent by Atlassian Jira (v8.3.2#803003)