[jira] [Commented] (ARROW-596) [Python] Add convenience function to convert pandas.DataFrame to pyarrow.Buffer containing a file or stream representation
[ https://issues.apache.org/jira/browse/ARROW-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907023#comment-15907023 ] Antoine Pitrou commented on ARROW-596: -- Cython allows you to implement the buffer protocol: see https://cython.readthedocs.io/en/latest/src/userguide/buffer.html . I've never used it but it looks similar to what you would do in C. Note that pyarrow.Buffer needs to be a fixed-size buffer for that operation to make sense. If not, then __getbuffer__ should lock the buffer size until __releasebuffer__ is called. > [Python] Add convenience function to convert pandas.DataFrame to > pyarrow.Buffer containing a file or stream representation > -- > > Key: ARROW-596 > URL: https://issues.apache.org/jira/browse/ARROW-596 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (ARROW-2544) [CI] Run C++ tests with two jobs on Travis-CI
Antoine Pitrou created ARROW-2544: - Summary: [CI] Run C++ tests with two jobs on Travis-CI Key: ARROW-2544 URL: https://issues.apache.org/jira/browse/ARROW-2544 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Antoine Pitrou Assignee: Omer Katz See https://github.com/apache/arrow/pull/1899 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2545) [Python] Arrow fails linking against statically-compiled Python
Antoine Pitrou created ARROW-2545: - Summary: [Python] Arrow fails linking against statically-compiled Python Key: ARROW-2545 URL: https://issues.apache.org/jira/browse/ARROW-2545 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou See https://issues.apache.org/jira/browse/ARROW-1661?focusedCommentId=16462745=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16462745 : to link statically against {{libpythonXX.a}}, you need to add in some system libraries such as {{libutil}}. Otherwise some symbols end up unresolved. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2546) [CI] Intermittent npm failures
Antoine Pitrou created ARROW-2546: - Summary: [CI] Intermittent npm failures Key: ARROW-2546 URL: https://issues.apache.org/jira/browse/ARROW-2546 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, JavaScript Reporter: Antoine Pitrou See for example https://travis-ci.org/apache/arrow/jobs/375891278 . {code} npm WARN deprecated gulp-util@3.0.8: gulp-util is deprecated - replace it, following the guidelines at https://medium.com/gulpjs/gulp-util-ca3b1f9f9ac5 npm WARN deprecated standard-format@1.6.10: standard-format is deprecated in favor of a built-in autofixer in 'standard'. Usage: standard --fix npm WARN deprecated minimatch@2.0.10: Please update to minimatch 3.0.2 or higher to avoid a RegExp DoS issue npm WARN tar ENOENT: no such file or directory, open '/home/travis/build/apache/arrow/js/node_modules/.staging/google-closure-compiler-2d7bab98/contrib/externs/maps/google_maps_api_v3_23.js' npm WARN ajv-keywords@3.2.0 requires a peer of ajv@^6.0.0 but none is installed. You must install peer dependencies yourself. npm WARN optional SKIPPING OPTIONAL DEPENDENCY: fsevents@1.2.3 (node_modules/fsevents): npm WARN enoent SKIPPING OPTIONAL DEPENDENCY: ENOENT: no such file or directory, rename '/home/travis/build/apache/arrow/js/node_modules/.staging/fsevents-5f35bbaf/node_modules/abbrev' -> '/home/travis/build/apache/arrow/js/node_modules/.staging/abbrev-e214f964' npm ERR! code EINTEGRITY npm ERR! sha512-bqB1yS6o9TNA9ZC/MJxM0FZzPnZdtHj0xWK/IZ5khzVqdpGul/R/EIiHRgFXlwTD7PSIaYVnGKq1QgMCu2mnqw== integrity checksum failed when using sha512: wanted sha512-bqB1yS6o9TNA9ZC/MJxM0FZzPnZdtHj0xWK/IZ5khzVqdpGul/R/EIiHRgFXlwTD7PSIaYVnGKq1QgMCu2mnqw== but got sha512-kgTmj+eAwkxGNzcVy5l66pJ3Exmxgj4IdQQ5fK53JTbfThLZFQybsk64V8pq2MMKXcqkkU6/0gGHXKbURv065w==. (4688848 bytes) npm ERR! A complete log of this run can be found in: npm ERR! /home/travis/.npm/_logs/2018-05-07T13_34_45_558Z-debug.log {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2563) [Rust] Poor caching in Travis-CI
Antoine Pitrou created ARROW-2563: - Summary: [Rust] Poor caching in Travis-CI Key: ARROW-2563 URL: https://issues.apache.org/jira/browse/ARROW-2563 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Rust Reporter: Antoine Pitrou Since the Rust project isn't at the repo root, Travis-CI won't compiled cache artifacts by default. This leads to long CI times as all packages get recompiled (see https://docs.travis-ci.com/user/caching/#Rust-Cargo-cache for what gets cached). In https://travis-ci.org/pitrou/arrow/jobs/376859806 I tried the following: {code} export CARGO_TARGET_DIR=$TRAVIS_BUILD_DIR/target {code} and after a first run, the build time went down to 2 minutes (from 15-18 minutes). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2562) [C++] Upload coverage data to codecov.io
Antoine Pitrou created ARROW-2562: - Summary: [C++] Upload coverage data to codecov.io Key: ARROW-2562 URL: https://issues.apache.org/jira/browse/ARROW-2562 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou ARROW-27 (upload coverage data to coveralls.io) has failed moving forward. We can try codecov.io instead, another free code coverage hosting service. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2561) [C++] Crash in cuda-test shutdown with coverage enabled
Antoine Pitrou created ARROW-2561: - Summary: [C++] Crash in cuda-test shutdown with coverage enabled Key: ARROW-2561 URL: https://issues.apache.org/jira/browse/ARROW-2561 Project: Apache Arrow Issue Type: Bug Components: C++, GPU Affects Versions: 0.9.0 Reporter: Antoine Pitrou If I enable both CUDA and code coverage (using {{-DARROW_GENERATE_COVERAGE=on}}), {{cuda-test}} sometimes crashes at shutdown with the following message: {code} *** Error in `./build-test/debug/cuda-test': corrupted size vs. prev_size: 0x01612bb0 *** === Backtrace: = /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fc3d61e47e5] /lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7fc3d61eb9dc] /lib/x86_64-linux-gnu/libc.so.6(+0x81cde)[0x7fc3d61eecde] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7fc3d61f1184] /home/antoine/arrow/cpp/build-test/debug/libarrow.so.10(+0x9350f3)[0x7fc3d5a510f3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0x9a)[0x7fc3d61a736a] /home/antoine/arrow/cpp/build-test/debug/libarrow.so.10(+0x3415e3)[0x7fc3d545d5e3] {code} (the CUDA tests themselves pass) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2566) [CI] Add codecov.io badge to README
Antoine Pitrou created ARROW-2566: - Summary: [CI] Add codecov.io badge to README Key: ARROW-2566 URL: https://issues.apache.org/jira/browse/ARROW-2566 Project: Apache Arrow Issue Type: Task Components: Continuous Integration Affects Versions: 0.9.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2568) [Python] Expose thread pool size setting to Python, and deprecate "nthreads"
Antoine Pitrou created ARROW-2568: - Summary: [Python] Expose thread pool size setting to Python, and deprecate "nthreads" Key: ARROW-2568 URL: https://issues.apache.org/jira/browse/ARROW-2568 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou Now that we have a global thread pool, we should: * use it in places where we currently require an explicit number of threads (with an additional {{use_threads}} argument to enable parallelism) * deprecate the now pointless {{nthreads}} argument * expose the thread pool capacity setting in Python -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2574) [CI] Collect and publish Python coverage
Antoine Pitrou created ARROW-2574: - Summary: [CI] Collect and publish Python coverage Key: ARROW-2574 URL: https://issues.apache.org/jira/browse/ARROW-2574 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou Now that our Travis-CI setup is able to collect and publish C++ and Rust coverage, we should do the same for Python and Cython modules in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2588) [Plasma] Random unique ids always use the same seed
Antoine Pitrou created ARROW-2588: - Summary: [Plasma] Random unique ids always use the same seed Key: ARROW-2588 URL: https://issues.apache.org/jira/browse/ARROW-2588 Project: Apache Arrow Issue Type: Bug Components: Plasma (C++) Reporter: Antoine Pitrou Following GitHub PR #2039 (resolution to ARROW-2578), the random generator for random object ids is now using a constant default seed, meaning all processes will generate the same sequence of random ids: {code:java} $ python -c "from pyarrow import plasma; print(plasma.ObjectID.from_random())" ObjectID(d022e7d520f8e938a14e188c47308cfef5fff7f7) $ python -c "from pyarrow import plasma; print(plasma.ObjectID.from_random())" ObjectID(d022e7d520f8e938a14e188c47308cfef5fff7f7) {code} As a sidenote, the plasma test suite should ideally test for this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2589) [Python] test_parquet.py regression with Pandas 0.23.0
Antoine Pitrou created ARROW-2589: - Summary: [Python] test_parquet.py regression with Pandas 0.23.0 Key: ARROW-2589 URL: https://issues.apache.org/jira/browse/ARROW-2589 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou See e.g. https://travis-ci.org/apache/arrow/jobs/379652352#L3124. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2642) [Python] Fail building parquet binding on Windows
Antoine Pitrou created ARROW-2642: - Summary: [Python] Fail building parquet binding on Windows Key: ARROW-2642 URL: https://issues.apache.org/jira/browse/ARROW-2642 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou For some reason I get the following error. I'm not sure why Thrift is needed here: {code} -- Found the Parquet library: C:/Miniconda3/envs/arrow/Library/lib/parquet.lib -- THRIFT_HOME: -- Thrift compiler/libraries NOT found: (THRIFT_INCLUDE_DIR-NOTFOUND, THRIFT_ST ATIC_LIB-NOTFOUND). Looked in system search paths. -- Boost version: 1.66.0 -- Found the following Boost libraries: -- regex Added static library dependency boost_regex: C:/Miniconda3/envs/arrow/Library/li b/libboost_regex.lib Added static library dependency parquet: C:/Miniconda3/envs/arrow/Library/lib/pa rquet_static.lib CMake Error at C:/t/arrow/cpp/cmake_modules/BuildUtils.cmake:88 (message): No static or shared library provided for thrift Call Stack (most recent call first): CMakeLists.txt:376 (ADD_THIRDPARTY_LIB) {code} The {{thrift-cpp}} package from conda-forge is installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2643) [C++] Travis-CI build failure with cpp toolchain enabled
Antoine Pitrou created ARROW-2643: - Summary: [C++] Travis-CI build failure with cpp toolchain enabled Key: ARROW-2643 URL: https://issues.apache.org/jira/browse/ARROW-2643 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Affects Versions: 0.9.0 Reporter: Antoine Pitrou This is a new failure, perhaps triggered by a conda-forge package update. See example at https://travis-ci.org/apache/arrow/jobs/385002355#L2235 {code} /usr/bin/ld: /home/travis/build/apache/arrow/cpp-toolchain/lib/libz.a(deflate.o): relocation R_X86_64_32S against `zcalloc' can not be used when making a shared object; recompile with -fPIC {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2660) [Python] Experiment with zero-copy pickling
Antoine Pitrou created ARROW-2660: - Summary: [Python] Experiment with zero-copy pickling Key: ARROW-2660 URL: https://issues.apache.org/jira/browse/ARROW-2660 Project: Apache Arrow Issue Type: Wish Components: Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou PEP 574 has an implementation ready and a PyPI-available backport (at [https://pypi.org/project/pickle5/] ). Adding experimental support for it would allow for zero-copy pickling of Arrow arrays, columns, etc. I think it mainly involves implementing {{reduce_ex}} on the {{Buffer}} class, as described in [https://www.python.org/dev/peps/pep-0574/#producer-api] In addition, the consumer API added by PEP 574 could be used in Arrow's serialization array, to avoid or minimize copies when serializing foreign objects. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2644) [Python] parquet binding fails building on AppVeyor
Antoine Pitrou created ARROW-2644: - Summary: [Python] parquet binding fails building on AppVeyor Key: ARROW-2644 URL: https://issues.apache.org/jira/browse/ARROW-2644 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou This is a new issue (perhaps due to a new Cython version). See e.g. https://ci.appveyor.com/project/pitrou/arrow/build/1.0.509/job/dxdqcdk30kmiy6pd#L4291 Excerpt: {code} -- Running cmake --build for pyarrow C:\Program Files (x86)\CMake\bin\cmake.exe --build . --config release [1/8] cmd.exe /C "cd /D C:\projects\arrow\python\build\temp.win-amd64-3.6\Release && C:\Miniconda36-x64\envs\arrow\python.exe -m cython --cplus --working C:/projects/arrow/python --output-file C:/projects/arrow/python/build/temp.win-amd64-3.6/Release/_parquet.cpp C:/projects/arrow/python/pyarrow/_parquet.pyx" [2/8] cmd.exe /c [3/8] cmd.exe /C "cd /D C:\projects\arrow\python\build\temp.win-amd64-3.6\Release && C:\Miniconda36-x64\envs\arrow\python.exe -m cython --cplus --working C:/projects/arrow/python --output-file C:/projects/arrow/python/build/temp.win-amd64-3.6/Release/lib.cpp C:/projects/arrow/python/pyarrow/lib.pyx" [4/8] cmd.exe /c [5/8] C:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1414~1.264\bin\Hostx64\x64\cl.exe /TP -DARROW_EXPORTING -D_CRT_SECURE_NO_WARNINGS -D_parquet_EXPORTS -IC:\Miniconda36-x64\envs\arrow\lib\site-packages\numpy\core\include -IC:\Miniconda36-x64\envs\arrow\include -I..\..\..\src -IC:\Miniconda36-x64\envs\arrow\Library\include /bigobj /W3 /wd4800 /DWIN32 /D_WINDOWS /GR /EHsc /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING /WX /wd4190 /wd4293 /wd4800 /MD /O2 /Ob2 /DNDEBUG /showIncludes /FoCMakeFiles\_parquet.dir\_parquet.cpp.obj /FdCMakeFiles\_parquet.dir\ /FS -c _parquet.cpp FAILED: CMakeFiles/_parquet.dir/_parquet.cpp.obj C:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1414~1.264\bin\Hostx64\x64\cl.exe /TP -DARROW_EXPORTING -D_CRT_SECURE_NO_WARNINGS -D_parquet_EXPORTS -IC:\Miniconda36-x64\envs\arrow\lib\site-packages\numpy\core\include -IC:\Miniconda36-x64\envs\arrow\include -I..\..\..\src -IC:\Miniconda36-x64\envs\arrow\Library\include /bigobj /W3 /wd4800 /DWIN32 /D_WINDOWS /GR /EHsc /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING /WX /wd4190 /wd4293 /wd4800 /MD /O2 /Ob2 /DNDEBUG /showIncludes /FoCMakeFiles\_parquet.dir\_parquet.cpp.obj /FdCMakeFiles\_parquet.dir\ /FS -c _parquet.cpp Microsoft (R) C/C++ Optimizing Compiler Version 19.14.26428.1 for x64 Copyright (C) Microsoft Corporation. All rights reserved. _parquet.cpp(6790): error C2220: warning treated as error - no 'object' file generated _parquet.cpp(6790): warning C4244: 'argument': conversion from 'int64_t' to 'long', possible loss of data [6/8] C:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1414~1.264\bin\Hostx64\x64\cl.exe /TP -DARROW_EXPORTING -D_CRT_SECURE_NO_WARNINGS -Dlib_EXPORTS -IC:\Miniconda36-x64\envs\arrow\lib\site-packages\numpy\core\include -IC:\Miniconda36-x64\envs\arrow\include -I..\..\..\src -IC:\Miniconda36-x64\envs\arrow\Library\include /bigobj /W3 /wd4800 /DWIN32 /D_WINDOWS /GR /EHsc /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING /WX /wd4190 /wd4293 /wd4800 /MD /O2 /Ob2 /DNDEBUG /showIncludes /FoCMakeFiles\lib.dir\lib.cpp.obj /FdCMakeFiles\lib.dir\ /FS -c lib.cpp Microsoft (R) C/C++ Optimizing Compiler Version 19.14.26428.1 for x64 Copyright (C) Microsoft Corporation. All rights reserved. ninja: build stopped: subcommand failed. error: command 'C:\\Program Files (x86)\\CMake\\bin\\cmake.exe' failed with exit status 1 (arrow) C:\projects\arrow\python>set lastexitcode=1 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2641) [C++] Investigate spurious memset() calls
Antoine Pitrou created ARROW-2641: - Summary: [C++] Investigate spurious memset() calls Key: ARROW-2641 URL: https://issues.apache.org/jira/browse/ARROW-2641 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.9.0 Reporter: Antoine Pitrou {{builder.cc}} has TODO statements of the form: {code:c++} // TODO(emkornfield) valgrind complains without this memset(data_->mutable_data(), 0, static_cast(nbytes)); {code} Ideally we shouldn't have to zero-initialize a data buffer before writing to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2653) [C++] Refactor hash table support
Antoine Pitrou created ARROW-2653: - Summary: [C++] Refactor hash table support Key: ARROW-2653 URL: https://issues.apache.org/jira/browse/ARROW-2653 Project: Apache Arrow Issue Type: Task Components: C++ Affects Versions: 0.9.0 Reporter: Antoine Pitrou Currently our hash table support is scattered in several places: * {{compute/kernels/hash.cc}} * {{util/hash.h}} and {{util/hash.cc}} * {{builder.cc}} (in the DictionaryBuilder implementation) Perhaps we should have something like a type-parametered hash table class (perhaps backed by non-owned memory) with several primitives: * decide allocation size for a given number of items * lookup an item * insert an item * decide whether resizing is needed * resize to a new memory area * ... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2740) [Python] Add address property to Buffer
Antoine Pitrou created ARROW-2740: - Summary: [Python] Add address property to Buffer Key: ARROW-2740 URL: https://issues.apache.org/jira/browse/ARROW-2740 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou This would allow getting the start address of the buffer's data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2785) [C++] Crash in json-integration-test
Antoine Pitrou created ARROW-2785: - Summary: [C++] Crash in json-integration-test Key: ARROW-2785 URL: https://issues.apache.org/jira/browse/ARROW-2785 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou This is probably something I keep getting wrong when creating a new environment, but after creating a Python 3.7 conda environment and installing the tool chain, I get the following crash (apparently boost-related): {code} $ ./build-test/debug/json-integration-test [==] Running 2 tests from 1 test case. [--] Global test environment set-up. [--] 2 tests from TestJSONIntegration [ RUN ] TestJSONIntegration.ConvertAndValidate *** Error in `./build-test/debug/json-integration-test': munmap_chunk(): invalid pointer: 0x7ffc22542578 *** === Backtrace: = /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f4762f257e5] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x1a8)[0x7f4762f32698] /home/antoine/miniconda3/envs/pyarrow37/lib/libstdc++.so.6(_ZNSsD1Ev+0x15)[0x7f476384cca5] ./build-test/debug/json-integration-test(_ZN5boost10filesystem4pathD1Ev+0x18)[0x694f4a] ./build-test/debug/json-integration-test[0x69205a] ./build-test/debug/json-integration-test(_ZN5arrow3ipc19TestJSONIntegration7mkstempEv+0x2c)[0x69599e] ./build-test/debug/json-integration-test(_ZN5arrow3ipc43TestJSONIntegration_ConvertAndValidate_Test8TestBodyEv+0x3b)[0x69210f] ./build-test/debug/json-integration-test(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x65)[0x8759da] ./build-test/debug/json-integration-test(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x5a)[0x86f65d] ./build-test/debug/json-integration-test(_ZN7testing4Test3RunEv+0xd5)[0x853697] ./build-test/debug/json-integration-test(_ZN7testing8TestInfo3RunEv+0x105)[0x853fef] ./build-test/debug/json-integration-test(_ZN7testing8TestCase3RunEv+0xf4)[0x8546f8] ./build-test/debug/json-integration-test(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x2ac)[0x85b666] ./build-test/debug/json-integration-test(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x65)[0x876eb7] ./build-test/debug/json-integration-test(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x5a)[0x870327] ./build-test/debug/json-integration-test(_ZN7testing8UnitTest3RunEv+0xc6)[0x85a128] ./build-test/debug/json-integration-test(_Z13RUN_ALL_TESTSv+0x11)[0x6945e6] ./build-test/debug/json-integration-test(main+0xfb)[0x693a2b] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f4762ece830] ./build-test/debug/json-integration-test(_start+0x29)[0x68b4a9] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2534) [C++] libarrow.so leaks zlib symbols
Antoine Pitrou created ARROW-2534: - Summary: [C++] libarrow.so leaks zlib symbols Key: ARROW-2534 URL: https://issues.apache.org/jira/browse/ARROW-2534 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.9.0 Reporter: Antoine Pitrou I get the following here: {code:bash} $ nm -D -C /home/antoine/miniconda3/envs/pyarrow/lib/libarrow.so.0.0.0 | \grep ' T ' | \grep -v arrow 0025bc8c T adler32_z 0025c4c9 T crc32_z 002ad638 T _fini 00078ab8 T _init {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2514) [Python] Inferring / converting nested Numpy array is very slow
Antoine Pitrou created ARROW-2514: - Summary: [Python] Inferring / converting nested Numpy array is very slow Key: ARROW-2514 URL: https://issues.apache.org/jira/browse/ARROW-2514 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou Converting a nested Numpy array nested walks over the Numpy data as Python objects, even if the dtype is not "object". This makes it pointlessly slow compared to the non-nested case, and even the nested Python list case: {code:python} >>> %%timeit data = list(range(1)) ...:pa.array(data) ...: 746 µs ± 8.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) >>> %%timeit data = np.arange(1) ...:pa.array(data) ...: 81.1 µs ± 57.7 ns per loop (mean ± std. dev. of 7 runs, 1 loops each) >>> %%timeit data = [np.arange(1)] ...:pa.array(data) ...: 3.39 ms ± 6.27 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2532) [C++] Add chunked builder classes
Antoine Pitrou created ARROW-2532: - Summary: [C++] Add chunked builder classes Key: ARROW-2532 URL: https://issues.apache.org/jira/browse/ARROW-2532 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.9.0 Reporter: Antoine Pitrou I think it would be useful to have chunked builders for list, string and binary types. A chunked builder would produce a chunked array as output, circumventing the 32-bit offset limit of those types. There's some special-casing scatterred around our Numpy conversion routines right now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2522) [C++] Version shared library files
Antoine Pitrou created ARROW-2522: - Summary: [C++] Version shared library files Key: ARROW-2522 URL: https://issues.apache.org/jira/browse/ARROW-2522 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.9.0 Reporter: Antoine Pitrou We should version installed shared library files (SO under Unix, DLL under Windows) to disambiguate incompatible ABI versions. CMake provides support for that: http://pusling.com/blog/?p=352 https://cmake.org/cmake/help/v3.11/prop_tgt/SOVERSION.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2033) pa.array() doesn't work with iterators
Antoine Pitrou created ARROW-2033: - Summary: pa.array() doesn't work with iterators Key: ARROW-2033 URL: https://issues.apache.org/jira/browse/ARROW-2033 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou pa.array handles iterables fine, but not iterators if size isn't passed: {code:java} >>> arr = pa.array(range(5)) >>> arr [ 0, 1, 2, 3, 4 ] >>> arr = pa.array(iter(range(5))) >>> arr [ NA, NA, NA, NA, NA ] {code} This is because InferArrowSize() first exhausts the iterator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2054) Compilation warnings
Antoine Pitrou created ARROW-2054: - Summary: Compilation warnings Key: ARROW-2054 URL: https://issues.apache.org/jira/browse/ARROW-2054 Project: Apache Arrow Issue Type: Task Components: C++ Affects Versions: 0.8.0 Reporter: Antoine Pitrou I suppose this may vary depending on the compiler, but I get the following warnings with gcc 4.9: {code} /home/antoine/arrow/cpp/src/plasma/fling.cc: In function ‘int send_fd(int, int)’: /home/antoine/arrow/cpp/src/plasma/fling.cc:46:50: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] *reinterpret_cast(CMSG_DATA(header)) = fd; ^ /home/antoine/arrow/cpp/src/arrow/python/io.cc: In member function ‘virtual arrow::Status arrow::py::PyReadableFile::Read(int64_t, std::shared_ptr*)’: /home/antoine/arrow/cpp/src/arrow/python/io.cc:153:60: warning: ‘bytes_obj’ may be used uninitialized in this function [-Wmaybe-uninitialized] Py_DECREF(bytes_obj); ^ /home/antoine/arrow/cpp/src/arrow/python/io.cc: In member function ‘virtual arrow::Status arrow::py::PyReadableFile::Read(int64_t, int64_t*, void*)’: /home/antoine/arrow/cpp/src/arrow/python/io.cc:141:60: warning: ‘bytes_obj’ may be used uninitialized in this function [-Wmaybe-uninitialized] Py_DECREF(bytes_obj); ^ /home/antoine/arrow/cpp/src/arrow/python/io.cc: In member function ‘virtual arrow::Status arrow::py::PyReadableFile::GetSize(int64_t*)’: /home/antoine/arrow/cpp/src/arrow/python/io.cc:187:20: warning: ‘file_size’ may be used uninitialized in this function [-Wmaybe-uninitialized] *size = file_size; ^ /home/antoine/arrow/cpp/src/arrow/python/io.cc:46:65: warning: ‘current_position’ may be used uninitialized in this function [-Wmaybe-uninitialized] const_cast (argspec), args...); ^ /home/antoine/arrow/cpp/src/arrow/python/io.cc:175:11: note: ‘current_position’ was declared here int64_t current_position; ^ /home/antoine/arrow/cpp/src/arrow/ipc/json-internal.cc: In function ‘arrow::Status arrow::ipc::internal::json::GetField(const Value&, const arrow::ipc::DictionaryMemo*, std::shared_ptr*)’: /home/antoine/arrow/cpp/src/arrow/ipc/json-internal.cc:876:81: warning: ‘dictionary_id’ may be used uninitialized in this function [-Wmaybe-uninitialized] RETURN_NOT_OK(dictionary_memo->GetDictionary(dictionary_id, )); ^ /home/antoine/arrow/cpp/src/arrow/ipc/json-internal.cc: In function ‘arrow::Status arrow::ipc::internal::json::ReadSchema(const Value&, arrow::MemoryPool*, std::shared_ptr*)’: /home/antoine/arrow/cpp/src/arrow/ipc/json-internal.cc:1354:80: warning: ‘dictionary_id’ may be used uninitialized in this function [-Wmaybe-uninitialized] RETURN_NOT_OK(dictionary_memo->AddDictionary(dictionary_id, dictionary)); ^ /home/antoine/arrow/cpp/src/arrow/ipc/json-internal.cc:1349:13: note: ‘dictionary_id’ was declared here int64_t dictionary_id; ^ In file included from /home/antoine/arrow/cpp/src/arrow/api.h:25:0, from /home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:29: /home/antoine/arrow/cpp/src/arrow/builder.h: In member function ‘arrow::Status arrow::py::TimestampConverter::AppendItem(const arrow::py::OwnedRef&)’: /home/antoine/arrow/cpp/src/arrow/builder.h:284:5: warning: ‘t’ may be used uninitialized in this function [-Wmaybe-uninitialized] raw_data_[length_++] = val; ^ /home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:576:13: note: ‘t’ was declared here int64_t t; ^ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2073) [Python] Create StructArray from sequence of tuples given a known data type
Antoine Pitrou created ARROW-2073: - Summary: [Python] Create StructArray from sequence of tuples given a known data type Key: ARROW-2073 URL: https://issues.apache.org/jira/browse/ARROW-2073 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou Following ARROW-1705, we should support calling {{pa.array}} with a sequence of tuples, presuming a struct type is passed for the {{type}} parameter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2074) [Python] Allow type inference for struct arrays
Antoine Pitrou created ARROW-2074: - Summary: [Python] Allow type inference for struct arrays Key: ARROW-2074 URL: https://issues.apache.org/jira/browse/ARROW-2074 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou Support inferring a struct type in a {{pa.array}} call, if a sequence of dicts (or dict of sequences?= is given. Of course, this could mean that the wrong field order may be inferred, though on Python 3.6+ dicts retain ordering until the first deletion. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2067) "pip install" doesn't work from source tree
Antoine Pitrou created ARROW-2067: - Summary: "pip install" doesn't work from source tree Key: ARROW-2067 URL: https://issues.apache.org/jira/browse/ARROW-2067 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou It seems that {{pip install .}} first copies the build dir into a temporary directory, and {{setuptools_scm}} then fails grabbing the git version from that location. AFAIR {{versioneer}} doesn't have that issue. {code:bash} $ pip install . Processing /home/antoine/arrow/python Complete output from command python setup.py egg_info: Traceback (most recent call last): File "", line 1, in File "/tmp/pip-v_mucrpj-build/setup.py", line 456, in url="https://arrow.apache.org/; File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/site-packages/setuptools/__init__.py", line 129, in setup return distutils.core.setup(**attrs) File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/core.py", line 108, in setup _setup_distribution = dist = klass(attrs) File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/site-packages/setuptools/dist.py", line 333, in __init__ _Distribution.__init__(self, attrs) File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/dist.py", line 281, in __init__ self.finalize_options() File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/site-packages/setuptools/dist.py", line 476, in finalize_options ep.load()(self, ep.name, value) File "/tmp/pip-v_mucrpj-build/.eggs/setuptools_scm-1.15.7-py3.6.egg/setuptools_scm/integration.py", line 22, in version_keyword dist.metadata.version = get_version(**value) File "/tmp/pip-v_mucrpj-build/.eggs/setuptools_scm-1.15.7-py3.6.egg/setuptools_scm/__init__.py", line 119, in get_version parsed_version = _do_parse(root, parse) File "/tmp/pip-v_mucrpj-build/.eggs/setuptools_scm-1.15.7-py3.6.egg/setuptools_scm/__init__.py", line 97, in _do_parse "use git+https://github.com/user/proj.git#egg=proj; % root) LookupError: setuptools-scm was unable to detect version for '/tmp/pip-v_mucrpj-build'. Make sure you're either building from a fully intact git repository or PyPI tarballs. Most other sources (such as GitHub's tarballs, a git checkout without the .git folder) don't contain the necessary metadata and will not work. For example, if you're using pip, instead of https://github.com/user/proj/archive/master.zip use git+https://github.com/user/proj.git#egg=proj Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-v_mucrpj-build/ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2072) [Python] decimal128.byte_width crashes
Antoine Pitrou created ARROW-2072: - Summary: [Python] decimal128.byte_width crashes Key: ARROW-2072 URL: https://issues.apache.org/jira/browse/ARROW-2072 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou {code:bash} $ python -c "import pyarrow as pa; ty = pa.decimal128(20, 7); print(ty.byte_width)" Erreur de segmentation (core dumped) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2052) Unify OwnedRef and ScopedRef
Antoine Pitrou created ARROW-2052: - Summary: Unify OwnedRef and ScopedRef Key: ARROW-2052 URL: https://issues.apache.org/jira/browse/ARROW-2052 Project: Apache Arrow Issue Type: Task Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Currently {{OwnedRef}} and {{ScopedRef}} have similar semantics with small differences. Furtheremore, the naming distinction isn't obvious. I propose to unify them as a single {{OwnedRef}} class with the following characteristics: - doesn't take the GIL automatically - has a {{release()}} method that decrefs the pointer (and sets the internal copy to NULL) before returning it - has a {{detach()}} method that returns the pointer (and sets the internal copy to NULL) without decrefing it For the rare situations where an {{OwnedRef}} may be destroyed with the GIL released, a {{OwnedRefNoGIL}} derived class would also be proposed (the naming scheme follows Cython here). Opinions / comments? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2092) [Python] Enhance benchmark suite
Antoine Pitrou created ARROW-2092: - Summary: [Python] Enhance benchmark suite Key: ARROW-2092 URL: https://issues.apache.org/jira/browse/ARROW-2092 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou We need to test more operations in the ASV-based benchmarks suite. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2111) [C++] Linting could be faster
Antoine Pitrou created ARROW-2111: - Summary: [C++] Linting could be faster Key: ARROW-2111 URL: https://issues.apache.org/jira/browse/ARROW-2111 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.8.0 Reporter: Antoine Pitrou Currently {{make lint}} style-checks C++ files sequentially (by calling {{cpplint}}). We could instead style-check those files in parallel. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2134) [CI] Make Travis commit inspection more robust
Antoine Pitrou created ARROW-2134: - Summary: [CI] Make Travis commit inspection more robust Key: ARROW-2134 URL: https://issues.apache.org/jira/browse/ARROW-2134 Project: Apache Arrow Issue Type: Task Components: Continuous Integration Reporter: Antoine Pitrou Assignee: Antoine Pitrou See [https://github.com/apache/arrow/pull/1586#issuecomment-364857558] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2138) [C++] Have FatalLog abort instead of exiting
Antoine Pitrou created ARROW-2138: - Summary: [C++] Have FatalLog abort instead of exiting Key: ARROW-2138 URL: https://issues.apache.org/jira/browse/ARROW-2138 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Not sure this is desirable, since {{util/logging.h}} was taken from glog, but the various debug checks current {{std::exit(1)}} on failure. This is a clean exit (though with an error code) and therefore doesn't trigger the usual debugging tools such as gdb or Python's faulthandler. By replacing it with something like {{std::abort()}} the exit would be recognized as a process crash. Thoughts? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented
Antoine Pitrou created ARROW-2142: - Summary: [Python] Conversion from Numpy struct array unimplemented Key: ARROW-2142 URL: https://issues.apache.org/jira/browse/ARROW-2142 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou {code:python} >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)])) >>> arr array([(1.5,)], dtype=[('x', '>> arr[0] (1.5,) >>> arr['x'] array([1.5], dtype=float32) >>> arr['x'][0] 1.5 >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) Traceback (most recent call last): File "", line 1, in pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) File "array.pxi", line 177, in pyarrow.lib.array File "error.pxi", line 77, in pyarrow.lib.check_status File "error.pxi", line 85, in pyarrow.lib.check_status ArrowNotImplementedError: /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: converter.Convert() NumPyConverter doesn't implement conversion. {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2140) [Python] Conversion from Numpy float16 array unimplemented
Antoine Pitrou created ARROW-2140: - Summary: [Python] Conversion from Numpy float16 array unimplemented Key: ARROW-2140 URL: https://issues.apache.org/jira/browse/ARROW-2140 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou {code:python} >>> arr = np.array([1.5], dtype=np.float16) >>> pa.array(arr) Traceback (most recent call last): File "", line 1, in pa.array(arr) File "array.pxi", line 177, in pyarrow.lib.array File "array.pxi", line 84, in pyarrow.lib._ndarray_to_array File "public-api.pxi", line 158, in pyarrow.lib.pyarrow_wrap_array KeyError: 10 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2141) [Python] Conversion from Numpy object array to varsize binary unimplemented
Antoine Pitrou created ARROW-2141: - Summary: [Python] Conversion from Numpy object array to varsize binary unimplemented Key: ARROW-2141 URL: https://issues.apache.org/jira/browse/ARROW-2141 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou {code:python} >>> arr = np.array([b'xx'], dtype=np.object) >>> pa.array(arr, type=pa.binary(2)) [ b'xx' ] >>> pa.array(arr, type=pa.binary()) Traceback (most recent call last): File "", line 1, in pa.array(arr, type=pa.binary()) File "array.pxi", line 177, in pyarrow.lib.array File "error.pxi", line 77, in pyarrow.lib.check_status File "error.pxi", line 85, in pyarrow.lib.check_status ArrowNotImplementedError: /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: converter.Convert() /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1098 code: compute::Cast(, *arr, type_, options, ) /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1022 code: Cast(ctx, Datum(array.data()), out_type, options, _out) /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1009 code: GetCastFunction(*value.type(), out_type, options, ) No cast implemented from binary to binary {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2147) [Python] Type inference doesn't work on lists of Numpy arrays
Antoine Pitrou created ARROW-2147: - Summary: [Python] Type inference doesn't work on lists of Numpy arrays Key: ARROW-2147 URL: https://issues.apache.org/jira/browse/ARROW-2147 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou {code:python} >>> arr = np.int16([2, 3, 4]) >>> pa.array(arr) [ 2, 3, 4 ] >>> pa.array([arr]) Traceback (most recent call last): File "", line 1, in pa.array([arr]) File "array.pxi", line 181, in pyarrow.lib.array File "array.pxi", line 26, in pyarrow.lib._sequence_to_array File "error.pxi", line 77, in pyarrow.lib.check_status ArrowInvalid: /home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:964 code: InferArrowType(seq, _type) /home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:321 code: seq_visitor.Visit(obj) /home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:195 code: VisitElem(ref, level) Error inferring Arrow data type for collection of Python objects. Got Python object of type ndarray but can only handle these types: bool, float, integer, date, datetime, bytes, unicode {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2150) [Python] array equality defaults to identity
Antoine Pitrou created ARROW-2150: - Summary: [Python] array equality defaults to identity Key: ARROW-2150 URL: https://issues.apache.org/jira/browse/ARROW-2150 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou I'm not sure this is deliberate, but it doesn't look very desirable to me: {code} >>> pa.array([1,2,3], type=pa.int32()) == pa.array([1,2,3], type=pa.int32()) False {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2148) [Python] to_pandas() on struct array returns object array
Antoine Pitrou created ARROW-2148: - Summary: [Python] to_pandas() on struct array returns object array Key: ARROW-2148 URL: https://issues.apache.org/jira/browse/ARROW-2148 Project: Apache Arrow Issue Type: Bug Reporter: Antoine Pitrou This should probably return a Numpy struct array instead: {code:python} >>> arr = pa.array([{'a': 1, 'b': 2.5}, {'a': 2, 'b': 3.5}], >>> type=pa.struct([pa.field('a', pa.int32()), pa.field('b', pa.float64())])) >>> arr.type StructType(struct) >>> arr.to_pandas() array([{'a': 1, 'b': 2.5}, {'a': 2, 'b': 3.5}], dtype=object) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2149) [Python] reorganize test_convert_pandas.py
Antoine Pitrou created ARROW-2149: - Summary: [Python] reorganize test_convert_pandas.py Key: ARROW-2149 URL: https://issues.apache.org/jira/browse/ARROW-2149 Project: Apache Arrow Issue Type: Task Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou {{test_convert_pandas.py}} is getting painful to navigate through. We should reorganize the tests in various classes / categories. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2151) [Python] Error when converting from list of uint64 arrays
Antoine Pitrou created ARROW-2151: - Summary: [Python] Error when converting from list of uint64 arrays Key: ARROW-2151 URL: https://issues.apache.org/jira/browse/ARROW-2151 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou {code:python} >>> pa.array(np.uint64([0,1,2]), type=pa.uint64()) [ 0, 1, 2 ] >>> pa.array([np.uint64([0,1,2])], type=pa.list_(pa.uint64())) Traceback (most recent call last): File "", line 1, in pa.array([np.uint64([0,1,2])], type=pa.list_(pa.uint64())) File "array.pxi", line 181, in pyarrow.lib.array File "array.pxi", line 36, in pyarrow.lib._sequence_to_array File "error.pxi", line 98, in pyarrow.lib.check_status ArrowException: Unknown error: /home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:979 code: AppendPySequence(seq, size, real_type, builder.get()) /home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:402 code: static_cast(this)->AppendSingle(ref.obj()) /home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:402 code: static_cast (this)->AppendSingle(ref.obj()) /home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:542 code: CheckPyError() an integer is required {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2155) [Python] pa.frombuffer(bytearray) returns immutable Buffer
Antoine Pitrou created ARROW-2155: - Summary: [Python] pa.frombuffer(bytearray) returns immutable Buffer Key: ARROW-2155 URL: https://issues.apache.org/jira/browse/ARROW-2155 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou I'd expect it to return a mutable buffer: {code:python} >>> pa.frombuffer(bytearray(10)).is_mutable False {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2154) [Python] __eq__ unimplemented on Buffer
Antoine Pitrou created ARROW-2154: - Summary: [Python] __eq__ unimplemented on Buffer Key: ARROW-2154 URL: https://issues.apache.org/jira/browse/ARROW-2154 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Having to call {{equals()}} is un-Pythonic: {code:python} >>> pa.frombuffer(b'foo') == pa.frombuffer(b'foo') False >>> pa.frombuffer(b'foo').equals(pa.frombuffer(b'foo')) True {code} Same for many other pyarrow types, incidently. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2156) [CI] Isolate Sphinx dependencies
Antoine Pitrou created ARROW-2156: - Summary: [CI] Isolate Sphinx dependencies Key: ARROW-2156 URL: https://issues.apache.org/jira/browse/ARROW-2156 Project: Apache Arrow Issue Type: Task Components: Continuous Integration Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou In the Travis Python test script, we always install the documentation dependencies. We should only install them when building the docs, since they are not trivial and may take time fetching. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2108) [Python] Update instructions for ASV
Antoine Pitrou created ARROW-2108: - Summary: [Python] Update instructions for ASV Key: ARROW-2108 URL: https://issues.apache.org/jira/browse/ARROW-2108 Project: Apache Arrow Issue Type: Task Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou Now that PR [https://github.com/airspeed-velocity/asv/pull/611] has been merged, we don't need to advertise our fork anymore. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2193) [Plasma] plasma_store forks endlessly
Antoine Pitrou created ARROW-2193: - Summary: [Plasma] plasma_store forks endlessly Key: ARROW-2193 URL: https://issues.apache.org/jira/browse/ARROW-2193 Project: Apache Arrow Issue Type: Bug Components: Plasma (C++) Reporter: Antoine Pitrou I'm not sure why, but when I run the pyarrow test suite (for example {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly: {code:bash} $ ps fuwww USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND [...] antoine 27869 12.0 0.4 863208 68976 pts/7S13:41 0:01 /home/antoine/miniconda3/envs/pyarrow/bin/python /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 -m 1 antoine 27885 13.0 0.4 863076 68560 pts/7S13:41 0:01 \_ /home/antoine/miniconda3/envs/pyarrow/bin/python /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 -m 1 antoine 27901 12.1 0.4 863076 68320 pts/7S13:41 0:01 \_ /home/antoine/miniconda3/envs/pyarrow/bin/python /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 -m 1 antoine 27920 13.6 0.4 863208 68868 pts/7S13:41 0:01 \_ /home/antoine/miniconda3/envs/pyarrow/bin/python /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 -m 1 [etc.] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2172) [Python] Incorrect conversion from Numpy array when stride % itemsize != 0
Antoine Pitrou created ARROW-2172: - Summary: [Python] Incorrect conversion from Numpy array when stride % itemsize != 0 Key: ARROW-2172 URL: https://issues.apache.org/jira/browse/ARROW-2172 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou In the example below, the input array has a stride that's not a multiple of the itemsize: {code:python} >>> data = np.array([(42, True), (43, False)], ...:dtype=[('x', np.int32), ('y', np.bool_)]) ...: ...: >>> data['x'] array([42, 43], dtype=int32) >>> pa.array(data['x'], type=pa.int32()) [ 42, 11009 ] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2173) [Python] NumPyBuffer destructor should hold the GIL
Antoine Pitrou created ARROW-2173: - Summary: [Python] NumPyBuffer destructor should hold the GIL Key: ARROW-2173 URL: https://issues.apache.org/jira/browse/ARROW-2173 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou Failure to hold the GIL can lead to crashes, depending on presence of several threads or whatever the object allocator needs to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2171) [Python] OwnedRef is fragile
Antoine Pitrou created ARROW-2171: - Summary: [Python] OwnedRef is fragile Key: ARROW-2171 URL: https://issues.apache.org/jira/browse/ARROW-2171 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou Some uses of OwnedRef can implicitly invoke its (default) copy constructor, which will lead to extraneous decrefs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2197) Document "undefined symbol" issue and workaround
Antoine Pitrou created ARROW-2197: - Summary: Document "undefined symbol" issue and workaround Key: ARROW-2197 URL: https://issues.apache.org/jira/browse/ARROW-2197 Project: Apache Arrow Issue Type: Task Components: Documentation Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou See [https://github.com/apache/arrow/issues/1612] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2218) [Python] PythonFile should infer mode when not given
Antoine Pitrou created ARROW-2218: - Summary: [Python] PythonFile should infer mode when not given Key: ARROW-2218 URL: https://issues.apache.org/jira/browse/ARROW-2218 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou The following is clearly not optimal: {code:python} >>> f = open('README.md', 'r') >>> pa.PythonFile(f).mode 'wb' {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2950) [C++] Clean up util/bit-util.h
Antoine Pitrou created ARROW-2950: - Summary: [C++] Clean up util/bit-util.h Key: ARROW-2950 URL: https://issues.apache.org/jira/browse/ARROW-2950 Project: Apache Arrow Issue Type: Task Reporter: Antoine Pitrou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3029) [Python] pkg_resources is slow
Antoine Pitrou created ARROW-3029: - Summary: [Python] pkg_resources is slow Key: ARROW-3029 URL: https://issues.apache.org/jira/browse/ARROW-3029 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.10.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou Importing and calling {{pkg_resources}} at pyarrow import time to get the version number is slow (around 200 ms. here, out of 640 ms. total for importing pyarrow). Instead we could generate a version file, which seems possible using {{setuptools_scm}}'s {{write_to}} parameter: https://github.com/pypa/setuptools_scm/#configuration-parameters -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3059) [C++] Streamline namespace array::test
Antoine Pitrou created ARROW-3059: - Summary: [C++] Streamline namespace array::test Key: ARROW-3059 URL: https://issues.apache.org/jira/browse/ARROW-3059 Project: Apache Arrow Issue Type: Task Components: C++ Affects Versions: 0.10.0 Reporter: Antoine Pitrou Currently we have some test helpers that live in the {{arrow::test}} namespace, some in {{arrow}} (or topic subnamespaces such as {{arrow::io}}). I see no reason for the discrepancy. I propose the simple solution of removing the {{arrow::test}} namespace altogether. If not desirable, then we should make sure we put all helpers in that namespace. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3060) [C++] Factor our parsing routines
Antoine Pitrou created ARROW-3060: - Summary: [C++] Factor our parsing routines Key: ARROW-3060 URL: https://issues.apache.org/jira/browse/ARROW-3060 Project: Apache Arrow Issue Type: Task Components: C++ Affects Versions: 0.10.0 Reporter: Antoine Pitrou We have implementations of casting strings to numbers in the {{compute}} directory. Those can be more broadly useful (for example when parsing CSV files). We should therefore centralize them in their own C++ module. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2992) [Python] Parquet benchmark failure
Antoine Pitrou created ARROW-2992: - Summary: [Python] Parquet benchmark failure Key: ARROW-2992 URL: https://issues.apache.org/jira/browse/ARROW-2992 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou This is a regression on git master: {code:python} Traceback (most recent call last): File "/home/antoine/asv/asv/benchmark.py", line 867, in commands[mode](args) File "/home/antoine/asv/asv/benchmark.py", line 844, in main_run result = benchmark.do_run() File "/home/antoine/asv/asv/benchmark.py", line 398, in do_run return self.run(*self._current_params) File "/home/antoine/asv/asv/benchmark.py", line 473, in run samples, number = self.benchmark_timing(timer, repeat, warmup_time, number=number) File "/home/antoine/asv/asv/benchmark.py", line 520, in benchmark_timing timing = timer.timeit(number) File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/timeit.py", line 178, in timeit timing = self.inner(it, self.timer) File "", line 6, in inner File "/home/antoine/asv/asv/benchmark.py", line 464, in func = lambda: self.func(*param) File "/home/antoine/arrow/python/benchmarks/parquet.py", line 54, in time_manifest_creation pq.ParquetManifest(self.tmpdir, thread_pool=thread_pool) TypeError: __init__() got an unexpected keyword argument 'thread_pool' {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2991) [CI] Cut down number of AppVeyor jobs
Antoine Pitrou created ARROW-2991: - Summary: [CI] Cut down number of AppVeyor jobs Key: ARROW-2991 URL: https://issues.apache.org/jira/browse/ARROW-2991 Project: Apache Arrow Issue Type: Task Components: Continuous Integration Affects Versions: 0.10.0 Reporter: Antoine Pitrou AppVeyor builds all jobs serially so it's important not to have too many of them to avoid builds taking too much time and queuing up. I suggest to remove the following jobs: - the Release build with Ninja and VS2015; we already have both a Release build with Ninja and VS2017, and a Debug build with Ninja and VS2015 - the two NMake builds: we already exercise the Ninja (cross-platform, fastest) and Visual Studio (standard under Windows) build chains [~Max Risuhin] you added some of those jobs, do you have any concerns? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3049) [C++/Python] ORC reader fails on empty file
Antoine Pitrou created ARROW-3049: - Summary: [C++/Python] ORC reader fails on empty file Key: ARROW-3049 URL: https://issues.apache.org/jira/browse/ARROW-3049 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.10.0 Reporter: Antoine Pitrou {code} Traceback (most recent call last): File "/home/antoine/arrow/python/pyarrow/tests/test_orc.py", line 83, in test_orcfile_empty check_example('TestOrcFile.emptyFile') File "/home/antoine/arrow/python/pyarrow/tests/test_orc.py", line 79, in check_example os.path.join(orc_data_dir, '%s.jsn.gz' % name)) File "/home/antoine/arrow/python/pyarrow/tests/test_orc.py", line 62, in check_example_files table = orc_file.read() File "/home/antoine/arrow/python/pyarrow/orc.py", line 149, in read return self.reader.read(include_indices=include_indices) File "pyarrow/_orc.pyx", line 106, in pyarrow._orc.ORCReader.read check_status(deref(self.reader).Read(_table)) File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status raise ArrowInvalid(message) pyarrow.lib.ArrowInvalid: Must pass at least one record batch {code} [~jim.crist] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3047) [C++] cmake downloads and builds ORC even though it's installed
Antoine Pitrou created ARROW-3047: - Summary: [C++] cmake downloads and builds ORC even though it's installed Key: ARROW-3047 URL: https://issues.apache.org/jira/browse/ARROW-3047 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.10.0 Reporter: Antoine Pitrou I have installed orc 1.5.1 from conda-forge, but our cmake build chain still tries to build protobuf and ORC from source (and fails). {code:bash} $ ls $CONDA_PREFIX/include/orc/ ColumnPrinter.hh Common.hh Exceptions.hh Int128.hh MemoryPool.hh orc-config.hh OrcFile.hh Reader.hh Statistics.hh Type.hh Vector.hh Writer.hh $ ls -l $CONDA_PREFIX/lib/liborc* -rw-rw-r-- 2 antoine antoine 1952298 juin 20 17:32 /home/antoine/miniconda3/envs/pyarrow/lib/liborc.a {code} [~jim.crist] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3095) [Python] test_plasma.py fails
Antoine Pitrou created ARROW-3095: - Summary: [Python] test_plasma.py fails Key: ARROW-3095 URL: https://issues.apache.org/jira/browse/ARROW-3095 Project: Apache Arrow Issue Type: Bug Components: Plasma (C++), Python Affects Versions: 0.10.0 Reporter: Antoine Pitrou All tests in {{test_plasma.py}} fail here. It seems that plasma_store fails launching or something: {code} $ python -m pytest -x -r s --tb=native pyarrow/tests/test_plasma.py test session starts platform linux -- Python 3.7.0, pytest-3.7.2, py-1.5.4, pluggy-0.7.1 rootdir: /home/antoine/arrow/python, inifile: setup.cfg plugins: timeout-1.3.1, faulthandler-1.5.0 collected 24 items pyarrow/tests/test_plasma.py E === ERRORS === ERROR at setup of TestPlasmaClient.test_connection_failure_raises_exception _ Traceback (most recent call last): File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 119, in setup_method self.plasma_client = plasma.connect(plasma_store_name, "", 64) File "pyarrow/_plasma.pyx", line 691, in pyarrow._plasma.connect check_status(result.client.get() File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status raise ArrowIOError(message) pyarrow.lib.ArrowIOError: ../src/plasma/client.cc:921 code: ConnectIpcSocketRetry(store_socket_name, num_retries, -1, _conn_) Could not connect to socket /tmp/test_plasma-ikgi25pf/plasma.sock --- Captured stderr setup Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 50 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 49 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 48 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 47 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 46 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 45 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 44 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 43 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 42 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 41 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 40 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 39 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 38 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 37 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 36 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 35 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 34 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 33 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 32 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 31 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 30 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 29 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 28 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 27 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 26 more times Connection to IPC socket failed for pathname /tmp/test_plasma-ikgi25pf/plasma.sock, retrying 25 more times Connection to IPC
[jira] [Created] (ARROW-3093) [C++] Linking errors with ORC enabled
Antoine Pitrou created ARROW-3093: - Summary: [C++] Linking errors with ORC enabled Key: ARROW-3093 URL: https://issues.apache.org/jira/browse/ARROW-3093 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.10.0 Reporter: Antoine Pitrou In an attempt to work around ARROW-3091 and ARROW-3092, I've recreated my conda environment, and now I get linking errors if ORC support is enabled: {code} debug/libarrow.so.11.0.0: error: undefined reference to 'google::protobuf::MessageLite::ParseFromString(std::string const&)' debug/libarrow.so.11.0.0: error: undefined reference to 'google::protobuf::MessageLite::SerializeToString(std::string*) const' debug/libarrow.so.11.0.0: error: undefined reference to 'google::protobuf::internal::fixed_address_empty_string' [etc.] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3092) [C++] Segfault in json-integration-test
Antoine Pitrou created ARROW-3092: - Summary: [C++] Segfault in json-integration-test Key: ARROW-3092 URL: https://issues.apache.org/jira/browse/ARROW-3092 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.10.0 Reporter: Antoine Pitrou I've upgraded to Ubuntu 18.04.1 and now I get segfaults in json-integration-test: {code} (gdb) run Starting program: /home/antoine/arrow/cpp/build-test/debug/json-integration-test [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [==] Running 2 tests from 1 test case. [--] Global test environment set-up. [--] 2 tests from TestJSONIntegration [ RUN ] TestJSONIntegration.ConvertAndValidate Program received signal SIGSEGV, Segmentation fault. std::string::_Rep::_M_is_leaked (this=this@entry=0xffe8) at /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.h:3075 3075 /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.h: Aucun fichier ou dossier de ce type. (gdb) bt #0 std::string::_Rep::_M_is_leaked (this=this@entry=0xffe8) at /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.h:3075 #1 0x77311856 in std::string::_Rep::_M_grab (this=0xffe8, __alloc1=..., __alloc2=...) at /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.h:3126 #2 0x7731189d in std::basic_string, std::allocator >::basic_string (this=0x7fffcf68, __str=...) at /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:613 #3 0x005f63fd in boost::filesystem::path::path (this=0x7fffcf68, p=...) at /home/antoine/miniconda3/envs/pyarrow/include/boost/filesystem/path.hpp:137 #4 0x005f628a in boost::filesystem::operator/ (lhs=..., rhs=...) at /home/antoine/miniconda3/envs/pyarrow/include/boost/filesystem/path.hpp:792 #5 0x005f1d37 in arrow::ipc::temp_path () at ../src/arrow/ipc/json-integration-test.cc:233 #6 0x005f3038 in arrow::ipc::TestJSONIntegration::mkstemp (this=) at ../src/arrow/ipc/json-integration-test.cc:241 Backtrace stopped: previous frame inner to this frame (corrupt stack?) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3091) [C++] Segfault in io-hdfs-test
Antoine Pitrou created ARROW-3091: - Summary: [C++] Segfault in io-hdfs-test Key: ARROW-3091 URL: https://issues.apache.org/jira/browse/ARROW-3091 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.10.0 Reporter: Antoine Pitrou I've upgraded to Ubuntu 18.04.1 and now I get segfaults in io-hdfs-test: {code} (gdb) run Starting program: /home/antoine/arrow/cpp/build-test/debug/io-hdfs-test [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Running main() from gtest_main.cc [==] Running 24 tests from 2 test cases. [--] Global test environment set-up. [--] 12 tests from TestHadoopFileSystem/0, where TypeParam = arrow::io::JNIDriver [ RUN ] TestHadoopFileSystem/0.ConnectsAgain Program received signal SIGSEGV, Segmentation fault. 0x775a15ae in boost::filesystem::path::m_append_separator_if_needed() () from /home/antoine/miniconda3/envs/pyarrow/lib/libboost_filesystem.so.1.67.0 (gdb) bt #0 0x775a15ae in boost::filesystem::path::m_append_separator_if_needed() () from /home/antoine/miniconda3/envs/pyarrow/lib/libboost_filesystem.so.1.67.0 #1 0x775a2917 in boost::filesystem::path::operator/=(boost::filesystem::path const&) () from /home/antoine/miniconda3/envs/pyarrow/lib/libboost_filesystem.so.1.67.0 #2 0x00549647 in boost::filesystem::operator/ (lhs=..., rhs=...) at /home/antoine/miniconda3/envs/pyarrow/include/boost/filesystem/path.hpp:792 #3 0x00547b2d in arrow::io::TestHadoopFileSystem::SetUp (this=0x7143c0) at ../src/arrow/io/io-hdfs-test.cc:98 #4 0x0065e98e in testing::internal::HandleSehExceptionsInMethodIfSupported (object=0x7143c0, method= testing::Test::SetUp(), location=0x66a48a "SetUp()") at /home/antoine/arrow/cpp/build-test/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2402 #5 0x0064a7e5 in testing::internal::HandleExceptionsInMethodIfSupported (object=0x7143c0, method= testing::Test::SetUp(), location=0x66a48a "SetUp()") at /home/antoine/arrow/cpp/build-test/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2438 #6 0x00632a14 in testing::Test::Run (this=0x7143c0) at /home/antoine/arrow/cpp/build-test/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2470 #7 0x006336fd in testing::TestInfo::Run (this=0x710420) at /home/antoine/arrow/cpp/build-test/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2656 #8 0x00633dbc in testing::TestCase::Run (this=0x7108f0) at /home/antoine/arrow/cpp/build-test/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2774 #9 0x0063b331 in testing::internal::UnitTestImpl::RunAllTests (this=0x710590) at /home/antoine/arrow/cpp/build-test/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:4649 #10 0x0066208e in testing::internal::HandleSehExceptionsInMethodIfSupported (object=0x710590, method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x63b050 , location=0x66ac25 "auxiliary test code (environments or event listeners)") at /home/antoine/arrow/cpp/build-test/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2402 #11 0x0064c945 in testing::internal::HandleExceptionsInMethodIfSupported (object=0x710590, method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x63b050 , location=0x66ac25 "auxiliary test code (environments or event listeners)") at /home/antoine/arrow/cpp/build-test/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2438 #12 0x0063b003 in testing::UnitTest::Run (this=0x6fcd48 ) at /home/antoine/arrow/cpp/build-test/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:4257 #13 0x00666481 in RUN_ALL_TESTS () at /home/antoine/arrow/cpp/build-test/googletest_ep-prefix/src/googletest_ep/googletest/include/gtest/gtest.h:2233 #14 0x0066644c in main (argc=1, argv=0x7fffd878) at /home/antoine/arrow/cpp/build-test/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc:37 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3110) [C++] Compilation warnings with gcc 7.3.0
Antoine Pitrou created ARROW-3110: - Summary: [C++] Compilation warnings with gcc 7.3.0 Key: ARROW-3110 URL: https://issues.apache.org/jira/browse/ARROW-3110 Project: Apache Arrow Issue Type: Task Components: C++ Affects Versions: 0.10.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou This is happening when building in release mode: {code} ../src/arrow/python/python_to_arrow.cc: In function 'arrow::Status arrow::py::detail::BuilderAppend(arrow::BinaryBuilder*, PyObject*, bool*)': ../src/arrow/python/python_to_arrow.cc:388:56: warning: 'length' may be used uninitialized in this function [-Wmaybe-uninitialized] if (ARROW_PREDICT_FALSE(builder->value_data_length() + length > kBinaryMemoryLimit)) { ^ ../src/arrow/python/python_to_arrow.cc:385:11: note: 'length' was declared here int32_t length; ^~ In file included from ../src/arrow/python/serialize.cc:32:0: ../src/arrow/builder.h: In member function 'arrow::Status arrow::py::SequenceBuilder::Update(int64_t, int8_t*)': ../src/arrow/builder.h:413:5: warning: 'offset32' may be used uninitialized in this function [-Wmaybe-uninitialized] raw_data_[length_++] = val; ^ ../src/arrow/python/serialize.cc:90:13: note: 'offset32' was declared here int32_t offset32; ^~~~ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3099) [C++] Add benchmark for number parsing
Antoine Pitrou created ARROW-3099: - Summary: [C++] Add benchmark for number parsing Key: ARROW-3099 URL: https://issues.apache.org/jira/browse/ARROW-3099 Project: Apache Arrow Issue Type: Wish Components: C++ Affects Versions: 0.10.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou Number parsing will become important once we have a CSV reader (or possibly other text-based formats). We should add benchmarks for the internal conversion routines. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3100) [CI] C/glib build broken on OS X
Antoine Pitrou created ARROW-3100: - Summary: [CI] C/glib build broken on OS X Key: ARROW-3100 URL: https://issues.apache.org/jira/browse/ARROW-3100 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, GLib Reporter: Antoine Pitrou The Travis-CI build fails to find luarocks: https://travis-ci.org/apache/arrow/jobs/418753219#L2657 {code} +sudo env PKG_CONFIG_PATH=:/usr/local/opt/libffi/lib/pkgconfig luarocks install lgi env: luarocks: No such file or directory The command "$TRAVIS_BUILD_DIR/ci/travis_before_script_c_glib.sh" failed and exited with 127 during . {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3125) [Python] Update ASV instructions
Antoine Pitrou created ARROW-3125: - Summary: [Python] Update ASV instructions Key: ARROW-3125 URL: https://issues.apache.org/jira/browse/ARROW-3125 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.10.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou The ability to define custom install / build / uninstall commands was added in mainline ASV in https://github.com/airspeed-velocity/asv/pull/699 We don't need to use our own fork / PR anymore. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3140) [Plasma] Plasma fails building with GPU enabled
Antoine Pitrou created ARROW-3140: - Summary: [Plasma] Plasma fails building with GPU enabled Key: ARROW-3140 URL: https://issues.apache.org/jira/browse/ARROW-3140 Project: Apache Arrow Issue Type: Bug Components: GPU, Plasma (C++) Reporter: Antoine Pitrou {code} In file included from ../src/plasma/client.h:30:0, from ../src/plasma/client.cc:20: ../src/plasma/common.h:120:19: error: ‘CudaIpcMemHandle’ was not declared in this scope std::shared_ptr ipc_handle; ^~~~ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2913) [Python] Exported buffers don't expose type information
Antoine Pitrou created ARROW-2913: - Summary: [Python] Exported buffers don't expose type information Key: ARROW-2913 URL: https://issues.apache.org/jira/browse/ARROW-2913 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 0.10.0 Reporter: Antoine Pitrou Using the {{buffers()}} method on array gives you a list of buffers backing the array, but those buffers lose typing information: {code:python} >>> a = pa.array(range(10)) >>> a.type DataType(int64) >>> buffers = a.buffers() >>> [(memoryview(buf).format, memoryview(buf).shape) for buf in buffers] [('b', (2,)), ('b', (80,))] {code} Conversely, Numpy exposes type information in the Python buffer protocol: {code:python} >>> a = pa.array(range(10)) >>> memoryview(a.to_numpy()).format 'l' >>> memoryview(a.to_numpy()).shape (10,) {code} Exposing type information on buffers could be important for third-party systems, such as Dask/distributed, for type-based data compression when serializing. Since our C++ buffers are not typed, it's not obvious how to solve this. Should we return tensors instead? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2867) [Python] Incorrect example for Cython usage
Antoine Pitrou created ARROW-2867: - Summary: [Python] Incorrect example for Cython usage Key: ARROW-2867 URL: https://issues.apache.org/jira/browse/ARROW-2867 Project: Apache Arrow Issue Type: Bug Components: Documentation, Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou When blindly pasting the Cython distutils example, one might get the following error: {code} Traceback (most recent call last): File "setup.py", line 20, in ext_modules=ext_modules, File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/core.py", line 148, in setup dist.run_commands() File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/dist.py", line 955, in run_commands self.run_command(cmd) File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/dist.py", line 974, in run_command cmd_obj.run() File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/command/build_ext.py", line 339, in run self.build_extensions() File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/command/build_ext.py", line 448, in build_extensions self._build_extensions_serial() File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/command/build_ext.py", line 473, in _build_extensions_serial self.build_extension(ext) File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/command/build_ext.py", line 558, in build_extension target_lang=language) File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/ccompiler.py", line 717, in link_shared_object extra_preargs, extra_postargs, build_temp, target_lang) File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/unixccompiler.py", line 159, in link libraries) File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/ccompiler.py", line 1089, in gen_lib_options lib_opts.append(compiler.library_dir_option(dir)) File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/distutils/unixccompiler.py", line 207, in library_dir_option return "-L" + dir TypeError: must be str, not list {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3142) [C++] Fetch all libs from toolchain environment
Antoine Pitrou created ARROW-3142: - Summary: [C++] Fetch all libs from toolchain environment Key: ARROW-3142 URL: https://issues.apache.org/jira/browse/ARROW-3142 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.10.0 Reporter: Antoine Pitrou When setting ARROW_BUILD_TOOLCHAIN, gtest and orc are currently not taken from the toolchain environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3167) [CI] Limit clcache cache size
Antoine Pitrou created ARROW-3167: - Summary: [CI] Limit clcache cache size Key: ARROW-3167 URL: https://issues.apache.org/jira/browse/ARROW-3167 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Antoine Pitrou Assignee: Antoine Pitrou The clcache cache on AppVeyor has a default max size of 1 GB and can reach close to this size (see e.g. https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/build/1.0.7722/job/5gp85w0m5xei0nme#L251). We should limit its size to something more reasonable to lower cache transfer / compression times. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2825) [C++] Need AllocateBuffer / AllocateResizableBuffer variant with default memory pool
Antoine Pitrou created ARROW-2825: - Summary: [C++] Need AllocateBuffer / AllocateResizableBuffer variant with default memory pool Key: ARROW-2825 URL: https://issues.apache.org/jira/browse/ARROW-2825 Project: Apache Arrow Issue Type: Wish Components: C++ Affects Versions: 0.9.0 Reporter: Antoine Pitrou It's not very practical that you have to pass the default memory pool explicitly to {{AllocateBuffer}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2826) [C++] Clarification needed between ArrayBuilder::Init(), Resize() and Reserve()
Antoine Pitrou created ARROW-2826: - Summary: [C++] Clarification needed between ArrayBuilder::Init(), Resize() and Reserve() Key: ARROW-2826 URL: https://issues.apache.org/jira/browse/ARROW-2826 Project: Apache Arrow Issue Type: Wish Components: C++ Affects Versions: 0.9.0 Reporter: Antoine Pitrou It's still not clear to me why we have three builder methods that seem to do essentially the same thing. This should be clarified somewhere in the docstrings. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2838) [Python] Speed up null testing with Pandas semantics
Antoine Pitrou created ARROW-2838: - Summary: [Python] Speed up null testing with Pandas semantics Key: ARROW-2838 URL: https://issues.apache.org/jira/browse/ARROW-2838 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou The {{PandasObjectIsNull}} helper function can be a significant contributor when converting a Pandas dataframe to Arrow format (e.g. when writing a dataframe to feather format). We can try to speed up the type checks in that function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2277) [Python] Tensor.from_numpy doesn't support struct arrays
Antoine Pitrou created ARROW-2277: - Summary: [Python] Tensor.from_numpy doesn't support struct arrays Key: ARROW-2277 URL: https://issues.apache.org/jira/browse/ARROW-2277 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou {code:python} >>> dt = np.dtype([('x', np.int8), ('y', np.float32)]) >>> dt.itemsize 5 >>> arr = np.arange(5*10, dtype=np.int8).view(dt) >>> pa.Tensor.from_numpy(arr) Traceback (most recent call last): File "", line 1, in pa.Tensor.from_numpy(arr) File "array.pxi", line 523, in pyarrow.lib.Tensor.from_numpy File "error.pxi", line 85, in pyarrow.lib.check_status ArrowNotImplementedError: /home/antoine/arrow/cpp/src/arrow/python/numpy_convert.cc:250 code: GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray)), ) Unsupported numpy type 20 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2278) [Python] deserializing Numpy struct arrays raises
Antoine Pitrou created ARROW-2278: - Summary: [Python] deserializing Numpy struct arrays raises Key: ARROW-2278 URL: https://issues.apache.org/jira/browse/ARROW-2278 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou {code:python} >>> import numpy as np >>> dt = np.dtype([('x', np.int8), ('y', np.float32)]) >>> arr = np.arange(5*10, dtype=np.int8).view(dt) >>> pa.deserialize(pa.serialize(arr).to_buffer()) Traceback (most recent call last): File "", line 1, in pa.deserialize(pa.serialize(arr).to_buffer()) File "serialization.pxi", line 441, in pyarrow.lib.deserialize File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from File "serialization.pxi", line 257, in pyarrow.lib.SerializedPyObject.deserialize File "serialization.pxi", line 174, in pyarrow.lib.SerializationContext._deserialize_callback File "/home/antoine/arrow/python/pyarrow/serialization.py", line 44, in _deserialize_numpy_array_list return np.array(data[0], dtype=np.dtype(data[1])) TypeError: a bytes-like object is required, not 'int' {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2288) [Python] slicing logic defective
Antoine Pitrou created ARROW-2288: - Summary: [Python] slicing logic defective Key: ARROW-2288 URL: https://issues.apache.org/jira/browse/ARROW-2288 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou The slicing logic tends to go too far when normalizing large negative bounds, which leads to results not in line with Python's slicing semantics: {code} >>> arr = pa.array([1,2,3,4]) >>> arr[-99:100] [ 2, 3, 4 ] >>> arr.to_pylist()[-99:100] [1, 2, 3, 4] >>> >>> >>> arr[-6:-5] [ 3 ] >>> arr.to_pylist()[-6:-5] [] {code} Also note this crash: {code} >>> arr[10:13] /home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= (data.length) Abandon (core dumped) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2237) [Python] Huge tables test failure
Antoine Pitrou created ARROW-2237: - Summary: [Python] Huge tables test failure Key: ARROW-2237 URL: https://issues.apache.org/jira/browse/ARROW-2237 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Antoine Pitrou This is a new failure here (Ubuntu 16.04, x86-64): {code} _ test_use_huge_pages _ Traceback (most recent call last): File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, in test_use_huge_pages create_object(plasma_client, 1) File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in create_object seal=seal) File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in create_object_with_id memory_buffer = client.create(object_id, data_size, metadata) File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create File "error.pxi", line 79, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, ) /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, , buffer) Encountered unexpected EOF Captured stderr call - Allowing the Plasma store to use up to 0.1GB of memory. Starting object store with directory /mnt/hugepages and huge page support enabled create_buffer failed to open file /mnt/hugepages/plasmapSNc0X {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2238) [C++] Detect clcache in cmake configuration
Antoine Pitrou created ARROW-2238: - Summary: [C++] Detect clcache in cmake configuration Key: ARROW-2238 URL: https://issues.apache.org/jira/browse/ARROW-2238 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2239) [C++] Update build docs for Windows
Antoine Pitrou created ARROW-2239: - Summary: [C++] Update build docs for Windows Key: ARROW-2239 URL: https://issues.apache.org/jira/browse/ARROW-2239 Project: Apache Arrow Issue Type: Task Components: C++, Documentation Reporter: Antoine Pitrou Fix For: 0.9.0 We should update the C++ build docs for Windows to recommend use of Ninja and clcache for faster builds. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2311) [Python] Struct array slicing defective
Antoine Pitrou created ARROW-2311: - Summary: [Python] Struct array slicing defective Key: ARROW-2311 URL: https://issues.apache.org/jira/browse/ARROW-2311 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou {code:python} >>> arr = pa.array([(1, 2.0), (3, 4.0), (5, 6.0)], >>> type=pa.struct([pa.field('x', pa.int16()), pa.field('y', pa.float32())])) >>> arr [ {'x': 1, 'y': 2.0}, {'x': 3, 'y': 4.0}, {'x': 5, 'y': 6.0} ] >>> arr[1:] [ {'x': 1, 'y': 2.0}, {'x': 3, 'y': 4.0} ] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2315) [C++/Python] Add method to flatten a struct array
Antoine Pitrou created ARROW-2315: - Summary: [C++/Python] Add method to flatten a struct array Key: ARROW-2315 URL: https://issues.apache.org/jira/browse/ARROW-2315 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou See ARROW-1886. We want to be able to take a StructArray and flatten it into independent field arrays. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2270) [Python] ForeignBuffer doesn't tie Python object lifetime to C++ buffer lifetime
Antoine Pitrou created ARROW-2270: - Summary: [Python] ForeignBuffer doesn't tie Python object lifetime to C++ buffer lifetime Key: ARROW-2270 URL: https://issues.apache.org/jira/browse/ARROW-2270 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou {{ForeignBuffer}} keeps the reference to the Python base object in the Python wrapper class, not in the C++ buffer instance, meaning if the C++ buffer gets passed around but the Python wrapper gets destroyed, the reference to the original Python base object will be released. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2271) [Python] test_plasma could make errors more diagnosable
Antoine Pitrou created ARROW-2271: - Summary: [Python] test_plasma could make errors more diagnosable Key: ARROW-2271 URL: https://issues.apache.org/jira/browse/ARROW-2271 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Antoine Pitrou Currently, when {{plasma_store}} fails for a reason or another, you get poorly readable errors from {{test_plasma.py}}. Displaying the child process' stderr would help. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2287) [Python] chunked array not iterable, not indexable
Antoine Pitrou created ARROW-2287: - Summary: [Python] chunked array not iterable, not indexable Key: ARROW-2287 URL: https://issues.apache.org/jira/browse/ARROW-2287 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou It would be useful to access individual elements of a chunked array either through iteration or indexing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2284) [Python] test_plasma error on plasma_store error
Antoine Pitrou created ARROW-2284: - Summary: [Python] test_plasma error on plasma_store error Key: ARROW-2284 URL: https://issues.apache.org/jira/browse/ARROW-2284 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou This appears caused by my latest changes: {code:python} Traceback (most recent call last): File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 192, in setup_method plasma_store_name, self.p = self.plasma_store_ctx.__enter__() File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/contextlib.py", line 81, in __enter__ return next(self.gen) File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 168, in start_plasma_store err = proc.stderr.read().decode() AttributeError: 'NoneType' object has no attribute 'read' {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2286) [Python] Allow subscripting pyarrow.lib.StructValue
Antoine Pitrou created ARROW-2286: - Summary: [Python] Allow subscripting pyarrow.lib.StructValue Key: ARROW-2286 URL: https://issues.apache.org/jira/browse/ARROW-2286 Project: Apache Arrow Issue Type: Wish Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou {code:python} >>> obj {'x': 42, 'y': True} >>> type(obj) pyarrow.lib.StructValue >>> obj['x'] Traceback (most recent call last): File "", line 1, in obj['x'] TypeError: 'pyarrow.lib.StructValue' object is not subscriptable {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2285) [Python] Can't convert Numpy string arrays
Antoine Pitrou created ARROW-2285: - Summary: [Python] Can't convert Numpy string arrays Key: ARROW-2285 URL: https://issues.apache.org/jira/browse/ARROW-2285 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou {code:python} >>> arr = np.array([b'foo', b'bar'], dtype='S3') >>> pa.array(arr, type=pa.binary(3)) Traceback (most recent call last): File "", line 1, in pa.array(arr, type=pa.binary(3)) File "array.pxi", line 177, in pyarrow.lib.array File "array.pxi", line 77, in pyarrow.lib._ndarray_to_array File "error.pxi", line 85, in pyarrow.lib.check_status ArrowNotImplementedError: /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1661 code: converter.Convert() NumPyConverter doesn't implementconversion. {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2276) [Python] Tensor could implement the buffer protocol
Antoine Pitrou created ARROW-2276: - Summary: [Python] Tensor could implement the buffer protocol Key: ARROW-2276 URL: https://issues.apache.org/jira/browse/ARROW-2276 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Tensors have an underlying buffer, a data type, shape and strides. It seems like they could implement the Python buffer protocol. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2275) [C++] Buffer::mutable_data_ member uninitialized
Antoine Pitrou created ARROW-2275: - Summary: [C++] Buffer::mutable_data_ member uninitialized Key: ARROW-2275 URL: https://issues.apache.org/jira/browse/ARROW-2275 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.8.0 Reporter: Antoine Pitrou For immutable buffers (i.e. most of them), the {{mutable_data_}} member is uninitialized. If the user calls {{mutable_data()}} by mistake on such a buffer, they will get a bogus pointer back. This is exacerbated by the Tensor API whose const and non-const {{raw_data()}} methods return different things... (also an idea: add a DCHECK for mutability before returning from {{mutable_data()}}?) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2309) [C++] Use std::make_unsigned
Antoine Pitrou created ARROW-2309: - Summary: [C++] Use std::make_unsigned Key: ARROW-2309 URL: https://issues.apache.org/jira/browse/ARROW-2309 Project: Apache Arrow Issue Type: Task Components: C++ Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou {{arrow/util/bit-util.h}} has a reimplementation of {{boost::make_unsigned}}, but we could simply use {{std::make_unsigned}}, which is C++11. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2426) [CI] glib build failure
Antoine Pitrou created ARROW-2426: - Summary: [CI] glib build failure Key: ARROW-2426 URL: https://issues.apache.org/jira/browse/ARROW-2426 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Antoine Pitrou The glib build on Travis-CI fails: [https://travis-ci.org/apache/arrow/jobs/364123364#L6840] {code} ==> Installing gobject-introspection ==> Downloading https://homebrew.bintray.com/bottles/gobject-introspection-1.56.0_1.sierra.bottle.tar.gz ==> Pouring gobject-introspection-1.56.0_1.sierra.bottle.tar.gz /usr/local/Cellar/gobject-introspection/1.56.0_1: 173 files, 9.8MB Installing gobject-introspection has failed! {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2442) [C++] Disambiguate Builder::Append overloads
Antoine Pitrou created ARROW-2442: - Summary: [C++] Disambiguate Builder::Append overloads Key: ARROW-2442 URL: https://issues.apache.org/jira/browse/ARROW-2442 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.9.0 Reporter: Antoine Pitrou See discussion in [https://github.com/apache/arrow/pull/1852#discussion_r179919627] There are various {{Append()}} overloads in Builder and subclasses, some of which append one value, some of which append multiple values at once. The API might be clearer and less error-prone if multiple-append variants were named differently, for example {{AppendValues()}}. Especially with the pointer-taking variants, it's probably easy to call the wrong overload by mistake. The existing methods would have to go through a deprecation cycle. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2400) [C++] Status destructor is expensive
Antoine Pitrou created ARROW-2400: - Summary: [C++] Status destructor is expensive Key: ARROW-2400 URL: https://issues.apache.org/jira/browse/ARROW-2400 Project: Apache Arrow Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Antoine Pitrou Let's take the following micro-benchmark (in Python): {code:bash} $ python -m timeit -s "import pyarrow as pa; data = [b'xx' for i in range(1)]" "pa.array(data, type=pa.binary())" 1000 loops, best of 3: 784 usec per loop {code} If I replace the Status destructor with a no-op: {code:c++} ~Status() { } {code} then the benchmark result becomes: {code:bash} $ python -m timeit -s "import pyarrow as pa; data = [b'xx' for i in range(1)]" "pa.array(data, type=pa.binary())" 1000 loops, best of 3: 561 usec per loop {code} This is almost a 30% win. I get similar results on the conversion benchmarks in the benchmark suite. I'm unsure about the explanation. In the common case, {{delete _state}} should be extremely fast, since the state is NULL. Yet, it seems it adds significant overhead. Perhaps because of exception handling? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2389) [C++] Add StatusCode::OverflowError
Antoine Pitrou created ARROW-2389: - Summary: [C++] Add StatusCode::OverflowError Key: ARROW-2389 URL: https://issues.apache.org/jira/browse/ARROW-2389 Project: Apache Arrow Issue Type: Wish Components: C++ Affects Versions: 0.9.0 Reporter: Antoine Pitrou It may be useful to have a {{StatusCode::OverflowError}} return code, to signal that something overflowed allowed limits (e.g. the 2GB limit for string or binary values). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2390) [C++/Python] CheckPyError() could inspect exception type
Antoine Pitrou created ARROW-2390: - Summary: [C++/Python] CheckPyError() could inspect exception type Key: ARROW-2390 URL: https://issues.apache.org/jira/browse/ARROW-2390 Project: Apache Arrow Issue Type: Wish Components: C++, Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou Current {{CheckPyError}} always chooses an "unknown error" status. But it could inspect the Python exception and choose, e.g. "type error" for a {{TypeError}} exception, etc. See also ARROW-2389 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2402) [C++] FixedSizeBinaryBuilder::Append lacks "const char*" overload
Antoine Pitrou created ARROW-2402: - Summary: [C++] FixedSizeBinaryBuilder::Append lacks "const char*" overload Key: ARROW-2402 URL: https://issues.apache.org/jira/browse/ARROW-2402 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.9.0 Reporter: Antoine Pitrou This implies that calling {{FixedSizeBinaryBuilder::Append}} with a "const char*" argument currently instantiates a temporary {{std::string}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)