[jira] [Updated] (ARROW-5448) [CI] MinGW build failures on AppVeyor
[ https://issues.apache.org/jira/browse/ARROW-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5448: -- Labels: pull-request-available (was: ) > [CI] MinGW build failures on AppVeyor > - > > Key: ARROW-5448 > URL: https://issues.apache.org/jira/browse/ARROW-5448 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Assignee: Kouhei Sutou >Priority: Blocker > Labels: pull-request-available > > Apparently the Numpy package is broken. See > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/24922425/job/9yoq08uepk5p6dwb > {code} > -- Found PythonLibs: C:/msys64/mingw32/lib/libpython3.7m.dll.a > CMake Error at cmake_modules/FindNumPy.cmake:62 (message): > NumPy import failure: > Traceback (most recent call last): > File > "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\__init__.py", line > 40, in > from . import multiarray > File > "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\multiarray.py", > line 12, in > from . import overrides > File > "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\overrides.py", line > 6, in > from numpy.core._multiarray_umath import ( > ImportError: DLL load failed: The specified module could not be found. > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1957) [Python] Write nanosecond timestamps using new NANO LogicalType Parquet unit
[ https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852680#comment-16852680 ] TP Boudreau commented on ARROW-1957: Yes, thanks for assigning it. > [Python] Write nanosecond timestamps using new NANO LogicalType Parquet unit > > > Key: ARROW-1957 > URL: https://issues.apache.org/jira/browse/ARROW-1957 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 > Environment: Python 3.6.4. Mac OSX and CentOS Linux release > 7.3.1611. Pandas 0.21.1 . >Reporter: Jordan Samuels >Assignee: TP Boudreau >Priority: Minor > Labels: parquet > Fix For: 0.14.0 > > > The following code > {code} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > n=3 > df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', > freq='1n', periods=n)) > pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code} > results in: > {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: > 14832288001}} > The desired effect is that we can save nanosecond resolution without losing > precision (e.g. conversion to ms). Note that if {{freq='1u'}} is used, the > code runs properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1837) [Java] Unable to read unsigned integers outside signed range for bit width in integration tests
[ https://issues.apache.org/jira/browse/ARROW-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned ARROW-1837: -- Assignee: Micah Kornfield > [Java] Unable to read unsigned integers outside signed range for bit width in > integration tests > --- > > Key: ARROW-1837 > URL: https://issues.apache.org/jira/browse/ARROW-1837 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Wes McKinney >Assignee: Micah Kornfield >Priority: Blocker > Labels: columnar-format-1.0 > Fix For: 0.14.0 > > Attachments: generated_primitive.json > > > I believe this was introduced recently (perhaps in the refactors), but there > was a problem where the integration tests weren't being properly run that hid > the error from us > see https://github.com/apache/arrow/pull/1294#issuecomment-345553066 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5429) [Java] Provide alternative buffer allocation policy
[ https://issues.apache.org/jira/browse/ARROW-5429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-5429. Resolution: Fixed Fix Version/s: 0.14.0 Issue resolved by pull request 4400 [https://github.com/apache/arrow/pull/4400] > [Java] Provide alternative buffer allocation policy > --- > > Key: ARROW-5429 > URL: https://issues.apache.org/jira/browse/ARROW-5429 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > The current buffer allocation policy works like this: > * If the requested buffer size is greater than or equal to the chunk size, > the buffer size will be as is. > * If the requested size is within the chunk size, the buffer size will be > rounded to the next power of 2. > This policy can lead to waste of memory in some cases. For example, if we > request a buffer of size 10MB, Arrow will round the buffer size to 16 MB. If > we only need 10 MB, this will lead to a waste of (16 - 10) / 10 = 60% of > memory. > So in this proposal, we provide another policy: the rounded buffer size must > be a multiple of some memory unit, like (32 KB). This policy has two benefits: > # The wasted memory cannot exceed one memory unit (32 KB), which is much > smaller than the power-of-two policy. > # This is the memory allocation policy adopted by some computation engines > (e.g. Apache Flink). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5420) [Java] Implement or remove getCurrentSizeInBytes in VariableWidthVector
[ https://issues.apache.org/jira/browse/ARROW-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-5420. Resolution: Fixed Fix Version/s: 0.14.0 Issue resolved by pull request 4390 [https://github.com/apache/arrow/pull/4390] > [Java] Implement or remove getCurrentSizeInBytes in VariableWidthVector > --- > > Key: ARROW-5420 > URL: https://issues.apache.org/jira/browse/ARROW-5420 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > Now VariableWidthVector#getCurrentSizeInBytes doesn't seem to have been > implemented. We should implement it or just remove it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5458) Apache Arrow parallel CRC32c computation optimization
[ https://issues.apache.org/jira/browse/ARROW-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852610#comment-16852610 ] Yuqi Gu commented on ARROW-5458: PR: https://github.com/apache/arrow/pull/4427 > Apache Arrow parallel CRC32c computation optimization > - > > Key: ARROW-5458 > URL: https://issues.apache.org/jira/browse/ARROW-5458 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yuqi Gu >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > ARMv8 defines VMULL/PMULL crypto instruction. > This patch optimizes crc32c calculate with the instruction when > available rather than original linear crc instructions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5458) Apache Arrow parallel CRC32c computation optimization
[ https://issues.apache.org/jira/browse/ARROW-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5458: -- Labels: pull-request-available (was: ) > Apache Arrow parallel CRC32c computation optimization > - > > Key: ARROW-5458 > URL: https://issues.apache.org/jira/browse/ARROW-5458 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yuqi Gu >Priority: Minor > Labels: pull-request-available > > ARMv8 defines VMULL/PMULL crypto instruction. > This patch optimizes crc32c calculate with the instruction when > available rather than original linear crc instructions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5458) Apache Arrow parallel CRC32c computation optimization
Yuqi Gu created ARROW-5458: -- Summary: Apache Arrow parallel CRC32c computation optimization Key: ARROW-5458 URL: https://issues.apache.org/jira/browse/ARROW-5458 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yuqi Gu ARMv8 defines VMULL/PMULL crypto instruction. This patch optimizes crc32c calculate with the instruction when available rather than original linear crc instructions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5457) [GLib][Plasma] Environment variable name for test is wrong
Kouhei Sutou created ARROW-5457: --- Summary: [GLib][Plasma] Environment variable name for test is wrong Key: ARROW-5457 URL: https://issues.apache.org/jira/browse/ARROW-5457 Project: Apache Arrow Issue Type: Bug Components: GLib Affects Versions: 0.13.0 Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5457) [GLib][Plasma] Environment variable name for test is wrong
[ https://issues.apache.org/jira/browse/ARROW-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5457: -- Labels: pull-request-available (was: ) > [GLib][Plasma] Environment variable name for test is wrong > -- > > Key: ARROW-5457 > URL: https://issues.apache.org/jira/browse/ARROW-5457 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Affects Versions: 0.13.0 >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5456) [GLib][Plasma] Installed plasma-glib may be used on building document
[ https://issues.apache.org/jira/browse/ARROW-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5456: -- Labels: pull-request-available (was: ) > [GLib][Plasma] Installed plasma-glib may be used on building document > - > > Key: ARROW-5456 > URL: https://issues.apache.org/jira/browse/ARROW-5456 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Affects Versions: 0.13.0 >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5456) [GLib][Plasma] Installed plasma-glib may be used on building document
Kouhei Sutou created ARROW-5456: --- Summary: [GLib][Plasma] Installed plasma-glib may be used on building document Key: ARROW-5456 URL: https://issues.apache.org/jira/browse/ARROW-5456 Project: Apache Arrow Issue Type: Bug Components: GLib Affects Versions: 0.13.0 Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5073) [C++] Build toolchain support for libcurl
[ https://issues.apache.org/jira/browse/ARROW-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5073: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Build toolchain support for libcurl > - > > Key: ARROW-5073 > URL: https://issues.apache.org/jira/browse/ARROW-5073 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > Fix For: 0.15.0 > > > libcurl can be used in a number of different situations (e.g. TensorFlow uses > it for GCS interactions > https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/cloud/gcs_file_system.cc) > so this will likely be required once we begin to tackle that problem -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5344) [C++] Use ArrayDataVisitor in implementation of dictionary unpacking in compute/kernels/cast.cc
[ https://issues.apache.org/jira/browse/ARROW-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5344: Fix Version/s: (was: 0.14.0) > [C++] Use ArrayDataVisitor in implementation of dictionary unpacking in > compute/kernels/cast.cc > --- > > Key: ARROW-5344 > URL: https://issues.apache.org/jira/browse/ARROW-5344 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > Follow-up to code review from ARROW-3144 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-5334) [C++] Add "Type" to names of arrow::Integer, arrow::FloatingPoint classes for consistency
[ https://issues.apache.org/jira/browse/ARROW-5334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-5334: --- Assignee: Wes McKinney > [C++] Add "Type" to names of arrow::Integer, arrow::FloatingPoint classes for > consistency > - > > Key: ARROW-5334 > URL: https://issues.apache.org/jira/browse/ARROW-5334 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > These intermediate classes used for template metaprogramming (in particular, > {{std::is_base_of}}) have inconsistent names with the rest of data types. For > clarity, I think we should add "Type" to these class names and others like > them > Please do after ARROW-3144 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2446) [C++] SliceBuffer on CudaBuffer should return CudaBuffer
[ https://issues.apache.org/jira/browse/ARROW-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2446: Fix Version/s: (was: 0.14.0) > [C++] SliceBuffer on CudaBuffer should return CudaBuffer > > > Key: ARROW-2446 > URL: https://issues.apache.org/jira/browse/ARROW-2446 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, GPU >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Priority: Major > > Currently {{SliceBuffer}} on a {{CudaBuffer}} returns a plain {{Buffer}} > instance, which is dangerous for unsuspecting consumers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4845) [C++] Compiler warnings on Windows MingW64
[ https://issues.apache.org/jira/browse/ARROW-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4845: Summary: [C++] Compiler warnings on Windows MingW64 (was: Compiler warnings on Windows) > [C++] Compiler warnings on Windows MingW64 > -- > > Key: ARROW-4845 > URL: https://issues.apache.org/jira/browse/ARROW-4845 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.12.1 >Reporter: Jeroen >Priority: Major > Fix For: 0.14.0 > > > I am seeing the warnings below when compiling the R bindings on Windows. Most > of these seem easy to fix (comparing int with size_t or int32 with int64). > {code} > array.cpp: In function 'Rcpp::LogicalVector Array__Mask(const > std::shared_ptr&)': > array.cpp:102:24: warning: comparison of integer expressions of different > signedness: 'size_t' {aka 'long long unsigned int'} and 'int64_t' {aka 'long > long int'} [-Wsign-compare] >for (size_t i = 0; i < array->length(); i++, bitmap_reader.Next()) { > ~~^ > /mingw64/bin/g++ -std=gnu++11 -I"C:/PROGRA~1/R/R-testing/include" -DNDEBUG > -DARROW_STATIC -I"C:/R/library/Rcpp/include"-O2 -Wall -mtune=generic > -c array__to_vector.cpp -o array__to_vector.o > array__to_vector.cpp: In member function 'virtual arrow::Status > arrow::r::Converter_Boolean::Ingest_some_nulls(SEXP, const > std::shared_ptr&, R_xlen_t, R_xlen_t) const': > array__to_vector.cpp:254:28: warning: comparison of integer expressions of > different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' > {aka 'long long int'} [-Wsign-compare] >for (size_t i = 0; i < n; i++, data_reader.Next(), null_reader.Next(), > ++p_data) { > ~~^~~ > array__to_vector.cpp:258:28: warning: comparison of integer expressions of > different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' > {aka 'long long int'} [-Wsign-compare] >for (size_t i = 0; i < n; i++, data_reader.Next(), ++p_data) { > ~~^~~ > array__to_vector.cpp: In member function 'virtual arrow::Status > arrow::r::Converter_Decimal::Ingest_some_nulls(SEXP, const > std::shared_ptr&, R_xlen_t, R_xlen_t) const': > array__to_vector.cpp:473:28: warning: comparison of integer expressions of > different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' > {aka 'long long int'} [-Wsign-compare] >for (size_t i = 0; i < n; i++, bitmap_reader.Next(), ++p_data) { > ~~^~~ > array__to_vector.cpp:478:28: warning: comparison of integer expressions of > different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' > {aka 'long long int'} [-Wsign-compare] >for (size_t i = 0; i < n; i++, ++p_data) { > ~~^~~ > array__to_vector.cpp: In member function 'virtual arrow::Status > arrow::r::Converter_Int64::Ingest_some_nulls(SEXP, const > std::shared_ptr&, R_xlen_t, R_xlen_t) const': > array__to_vector.cpp:515:28: warning: comparison of integer expressions of > different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' > {aka 'long long int'} [-Wsign-compare] >for (size_t i = 0; i < n; i++, bitmap_reader.Next(), ++p_data) { > ~~^~~ > array__to_vector.cpp: In instantiation of 'arrow::Status > arrow::r::SomeNull_Ingest(SEXP, R_xlen_t, R_xlen_t, const array_value_type*, > const std::shared_ptr&, Lambda) [with int RTYPE = 14; > array_value_type = long long int; Lambda = > arrow::r::Converter_Date64::Ingest_some_nulls(SEXP, const > std::shared_ptr&, R_xlen_t, R_xlen_t) const::; > SEXP = SEXPREC*; R_xlen_t = long long int]': > array__to_vector.cpp:366:77: required from here > array__to_vector.cpp:116:26: warning: comparison of integer expressions of > different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' > {aka 'long long int'} [-Wsign-compare] > for (size_t i = 0; i < n; i++, bitmap_reader.Next(), ++p_data, > ++p_values) { > ~~^~~ > array__to_vector.cpp: In instantiation of 'arrow::Status > arrow::r::SomeNull_Ingest(SEXP, R_xlen_t, R_xlen_t, const array_value_type*, > const std::shared_ptr&, Lambda) [with int RTYPE = 13; > array_value_type = unsigned char; Lambda = > arrow::r::Converter_Dictionary::Ingest_some_nulls_Impl(SEXP, const > std::shared_ptr&, R_xlen_t, R_xlen_t) const [with Type = > arrow::UInt8Type; SEXP = SEXPREC*; R_xlen_t = long long > int]::; SEXP = SEXPREC*; R_xlen_t = long long int]': > array__to_vector.cpp:341:47: required from 'arrow::Status > arrow::r::Converter_Dictionary::Ingest_some_nulls_Impl(SEXP, const > std::shared_ptr&, R_xlen_t, R_xlen_t) const [with Type = > arrow::UInt8Type;
[jira] [Updated] (ARROW-4838) [C++] Implement safe Make constructor
[ https://issues.apache.org/jira/browse/ARROW-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4838: Fix Version/s: (was: 0.14.0) > [C++] Implement safe Make constructor > - > > Key: ARROW-4838 > URL: https://issues.apache.org/jira/browse/ARROW-4838 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > > The following classes need validating constructors: > * ArrayData > * ChunkedArray > * RecordBatch > * Column > * Table -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-5448) [CI] MinGW build failures on AppVeyor
[ https://issues.apache.org/jira/browse/ARROW-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-5448: --- Assignee: Kouhei Sutou > [CI] MinGW build failures on AppVeyor > - > > Key: ARROW-5448 > URL: https://issues.apache.org/jira/browse/ARROW-5448 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Assignee: Kouhei Sutou >Priority: Blocker > > Apparently the Numpy package is broken. See > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/24922425/job/9yoq08uepk5p6dwb > {code} > -- Found PythonLibs: C:/msys64/mingw32/lib/libpython3.7m.dll.a > CMake Error at cmake_modules/FindNumPy.cmake:62 (message): > NumPy import failure: > Traceback (most recent call last): > File > "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\__init__.py", line > 40, in > from . import multiarray > File > "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\multiarray.py", > line 12, in > from . import overrides > File > "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\overrides.py", line > 6, in > from numpy.core._multiarray_umath import ( > ImportError: DLL load failed: The specified module could not be found. > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4752) [Rust] Add explicit SIMD vectorization for the divide kernel
[ https://issues.apache.org/jira/browse/ARROW-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4752: Fix Version/s: (was: 0.14.0) > [Rust] Add explicit SIMD vectorization for the divide kernel > > > Key: ARROW-4752 > URL: https://issues.apache.org/jira/browse/ARROW-4752 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4701) [C++] Add JSON chunker benchmarks
[ https://issues.apache.org/jira/browse/ARROW-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4701: Fix Version/s: (was: 0.14.0) > [C++] Add JSON chunker benchmarks > - > > Key: ARROW-4701 > URL: https://issues.apache.org/jira/browse/ARROW-4701 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Benjamin Kietzman >Assignee: Benjamin Kietzman >Priority: Minor > > The JSON chunker is not currently benchmarked or tested, but it is a > necessary component of a multithreaded reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4757) [C++] Nested chunked array support
[ https://issues.apache.org/jira/browse/ARROW-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4757: Fix Version/s: (was: 0.14.0) > [C++] Nested chunked array support > -- > > Key: ARROW-4757 > URL: https://issues.apache.org/jira/browse/ARROW-4757 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Philipp Moritz >Priority: Major > > Dear all, > I'm currently trying to lift the 2GB limit on the python serialization. For > this, I implemented a chunked union builder to split the array into smaller > arrays. > However, some of the children of the union array can be ListArrays, which can > themselves contain UnionArrays which can contain ListArrays etc. I'm at a bit > of a loss how to handle this. In principle I'd like to chunk the children > too. However, currently UnionArrays can only have children of type Array, and > there is no way to treat a chunked array (which is a vector of Arrays) as an > Array to store it as a child of a UnionArray. Any ideas how to best support > this use case? > -- Philipp. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4709) [C++] Optimize for ordered JSON fields
[ https://issues.apache.org/jira/browse/ARROW-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4709: Fix Version/s: (was: 0.14.0) > [C++] Optimize for ordered JSON fields > -- > > Key: ARROW-4709 > URL: https://issues.apache.org/jira/browse/ARROW-4709 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Benjamin Kietzman >Assignee: Benjamin Kietzman >Priority: Minor > > Fields appear consistently ordered in most JSON data in the wild, but the > JSON parser currently looks fields up in a hash table. The ordering can > probably be exploited to yield better performance when looking up field > indices -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4695) [JS] Tests timing out on Travis
[ https://issues.apache.org/jira/browse/ARROW-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4695: Fix Version/s: (was: 0.14.0) > [JS] Tests timing out on Travis > --- > > Key: ARROW-4695 > URL: https://issues.apache.org/jira/browse/ARROW-4695 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, JavaScript >Affects Versions: JS-0.4.0 >Reporter: Brian Hulette >Priority: Major > Labels: ci-failure, travis-ci > > Example build: https://travis-ci.org/apache/arrow/jobs/498967250 > JS tests sometimes fail with the following message: > {noformat} > > apache-arrow@ test /home/travis/build/apache/arrow/js > > NODE_NO_WARNINGS=1 gulp test > [22:14:01] Using gulpfile ~/build/apache/arrow/js/gulpfile.js > [22:14:01] Starting 'test'... > [22:14:01] Starting 'test:ts'... > [22:14:49] Finished 'test:ts' after 47 s > [22:14:49] Starting 'test:src'... > [22:15:27] Finished 'test:src' after 38 s > [22:15:27] Starting 'test:apache-arrow'... > No output has been received in the last 10m0s, this potentially indicates a > stalled build or something wrong with the build itself. > Check the details on how to adjust your build configuration on: > https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received > The build has been terminated > {noformat} > I thought maybe we were just running up against some time limit, but that > particular build was terminated at 22:25:27, exactly ten minutes after the > last output, at 22:15:27. So it does seem like the build is somehow stalling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4668) [C++] Support GCP BigQuery Storage API
[ https://issues.apache.org/jira/browse/ARROW-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4668: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Support GCP BigQuery Storage API > -- > > Key: ARROW-4668 > URL: https://issues.apache.org/jira/browse/ARROW-4668 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: filesystem > Fix For: 0.15.0 > > > Docs: [https://cloud.google.com/bigquery/docs/reference/storage/] > Need to investigate the best way to do this maybe just see if we can build > our client on GCP (once a protobuf definition is published to > [https://github.com/googleapis/googleapis/tree/master/google)?|https://github.com/googleapis/googleapis/tree/master/google)] > > This will serve as a parent issue, and sub-issues will be added for subtasks > if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4677) [Python] serialization does not consider ndarray endianness
[ https://issues.apache.org/jira/browse/ARROW-4677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4677: Fix Version/s: (was: 0.14.0) > [Python] serialization does not consider ndarray endianness > --- > > Key: ARROW-4677 > URL: https://issues.apache.org/jira/browse/ARROW-4677 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.1 > Environment: * pyarrow 0.12.1 > * numpy 1.16.1 > * Python 3.7.0 > * Intel Core i7-7820HQ > * (macOS 10.13.6) >Reporter: Gabe Joseph >Priority: Minor > > {{pa.serialize}} does not appear to properly encode the endianness of > multi-byte data: > {code} > # roundtrip.py > import numpy as np > import pyarrow as pa > arr = np.array([1], dtype=np.dtype('>i2')) > buf = pa.serialize(arr).to_buffer() > result = pa.deserialize(buf) > print(f"Original: {arr.dtype.str}, deserialized: {result.dtype.str}") > np.testing.assert_array_equal(arr, result) > {code} > {code} > $ pipenv run python roundtrip.py > Original: >i2, deserialized: Traceback (most recent call last): > File "roundtrip.py", line 10, in > np.testing.assert_array_equal(arr, result) > File > "/Users/gabejoseph/.local/share/virtualenvs/arrow-roundtrip-1xVSuBtp/lib/python3.7/site-packages/numpy/testing/_private/utils.py", > line 896, in assert_array_equal > verbose=verbose, header='Arrays are not equal') > File > "/Users/gabejoseph/.local/share/virtualenvs/arrow-roundtrip-1xVSuBtp/lib/python3.7/site-packages/numpy/testing/_private/utils.py", > line 819, in assert_array_compare > raise AssertionError(msg) > AssertionError: > Arrays are not equal > Mismatch: 100% > Max absolute difference: 255 > Max relative difference: 0.99609375 > x: array([1], dtype=int16) > y: array([256], dtype=int16) > {code} > The data of the deserialized array is identical (big-endian), but the dtype > Arrow assigns to it doesn't reflect its endianness (presumably uses the > system endianness, which is little). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4633) [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway
[ https://issues.apache.org/jira/browse/ARROW-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4633: Fix Version/s: (was: 0.14.0) > [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway > -- > > Key: ARROW-4633 > URL: https://issues.apache.org/jira/browse/ARROW-4633 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.12.0 > Environment: Linux, Python 3.7.1, pyarrow.__version__ = 0.12.0 >Reporter: Taylor Johnson >Priority: Minor > Labels: newbie, parquet > > The following code seems to suggest that ParquetFile.read(use_threads=False) > still creates a ThreadPool. This is observed in > ParquetFile.read_row_group(use_threads=False) as well. > This does not appear to be a problem in > pyarrow.Table.to_pandas(use_threads=False). > I've tried tracing the error. Starting in python/pyarrow/parquet.py, both > ParquetReader.read_all() and ParquetReader.read_row_group() pass the > use_threads input along to self.reader which is a ParquetReader imported from > _parquet.pyx > Following the calls into python/pyarrow/_parquet.pyx, we see that > ParquetReader.read_all() and ParquetReader.read_row_group() have the > following code which seems a bit suspicious > {quote}if use_threads: > self.set_use_threads(use_threads) > {quote} > Why not just always call self.set_use_threads(use_threads)? > The ParquetReader.set_use_threads simply calls > self.reader.get().set_use_threads(use_threads). This self.reader is assigned > as unique_ptr[FileReader]. I think this points to > cpp/src/parquet/arrow/reader.cc, but I'm not sure about that. The > FileReader::Impl::ReadRowGroup logic looks ok, as a call to > ::arrow::internal::GetCpuThreadPool() is only called if use_threads is True. > The same is true for ReadTable. > So when is the ThreadPool getting created? > Example code: > -- > {quote}import pandas as pd > import psutil > import pyarrow as pa > import pyarrow.parquet as pq > use_threads=False > p=psutil.Process() > print('Starting with {} threads'.format(p.num_threads())) > df = pd.DataFrame(\{'x':[0]}) > table = pa.Table.from_pandas(df) > print('After table creation, {} threads'.format(p.num_threads())) > df = table.to_pandas(use_threads=use_threads) > print('table.to_pandas(use_threads={}), {} threads'.format(use_threads, > p.num_threads())) > writer = pq.ParquetWriter('tmp.parquet', table.schema) > writer.write_table(table) > writer.close() > print('After writing parquet file, {} threads'.format(p.num_threads())) > pf = pq.ParquetFile('tmp.parquet') > print('After ParquetFile, {} threads'.format(p.num_threads())) > df = pf.read(use_threads=use_threads).to_pandas() > print('After pf.read(use_threads={}), {} threads'.format(use_threads, > p.num_threads())) > {quote} > --- > $ python pyarrow_test.py > Starting with 1 threads > After table creation, 1 threads > table.to_pandas(use_threads=False), 1 threads > After writing parquet file, 1 threads > After ParquetFile, 1 threads > After pf.read(use_threads=False), 5 threads -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4649) [C++/CI/R] Add nightly job that builds `brew install apache-arrow --HEAD`
[ https://issues.apache.org/jira/browse/ARROW-4649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4649: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++/CI/R] Add nightly job that builds `brew install apache-arrow --HEAD` > - > > Key: ARROW-4649 > URL: https://issues.apache.org/jira/browse/ARROW-4649 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration, R >Reporter: Uwe L. Korn >Priority: Major > Labels: nightly, travis-ci > Fix For: 0.15.0 > > > Now that we have an Arrow homebrew formula again and we may want to have it > as a simple setup for R Arrow users, we should add a nightly crossbow task > that checks whether this still builds fine. > To implement this, one should write a new travis.yml like > [https://github.com/apache/arrow/blob/master/dev/tasks/python-wheels/travis.osx.yml] > that calls {{brew install apache-arrow --HEAD}}. This task should then be > added to https://github.com/apache/arrow/blob/master/dev/tasks/tests.yml so > that it is executed as part of the nightly chain. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4649) [C++/CI/R] Add nightly job that builds `brew install apache-arrow --HEAD`
[ https://issues.apache.org/jira/browse/ARROW-4649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4649: Labels: nightly travis-ci (was: travis-ci) > [C++/CI/R] Add nightly job that builds `brew install apache-arrow --HEAD` > - > > Key: ARROW-4649 > URL: https://issues.apache.org/jira/browse/ARROW-4649 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration, R >Reporter: Uwe L. Korn >Priority: Major > Labels: nightly, travis-ci > Fix For: 0.14.0 > > > Now that we have an Arrow homebrew formula again and we may want to have it > as a simple setup for R Arrow users, we should add a nightly crossbow task > that checks whether this still builds fine. > To implement this, one should write a new travis.yml like > [https://github.com/apache/arrow/blob/master/dev/tasks/python-wheels/travis.osx.yml] > that calls {{brew install apache-arrow --HEAD}}. This task should then be > added to https://github.com/apache/arrow/blob/master/dev/tasks/tests.yml so > that it is executed as part of the nightly chain. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4661) [C++] Consolidate random string generators for use in benchmarks and unittests
[ https://issues.apache.org/jira/browse/ARROW-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4661: Fix Version/s: (was: 0.14.0) > [C++] Consolidate random string generators for use in benchmarks and unittests > -- > > Key: ARROW-4661 > URL: https://issues.apache.org/jira/browse/ARROW-4661 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Hatem Helal >Assignee: Hatem Helal >Priority: Minor > > This was discussed in here: > [https://github.com/apache/arrow/pull/3721] > For testing/benchmarking dictionary encoding its useful to control the number > of repeated values and it would also be good to optionally include null > values. The ability to provide a custom alphabet would be handy for > generating strings with unicode characters. > > Also note that a simple PRNG should be used as the group has observed > performance trouble with Mersenne Twister. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4648) [C++/Question] Naming/organizational inconsistencies in cpp codebase
[ https://issues.apache.org/jira/browse/ARROW-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4648: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++/Question] Naming/organizational inconsistencies in cpp codebase > > > Key: ARROW-4648 > URL: https://issues.apache.org/jira/browse/ARROW-4648 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Priority: Major > Fix For: 0.15.0 > > > Even after my eyes are used to the codebase, I still find the namings and/or > code organization inconsistent. > h2. File Formats > So arrow already support a couple of file formats, namely parquet, feather, > json, csv, orc, but their placement in the codebase is quiet odd: > - parquet: src/parquet > - feather: src/arrow/ipc/feather > - orc: src/arrow/adapters/orc > - csv: src/arrow/csv > - json: src/arrow/json > I might misunderstand the purpose of these sources, but I'd expect them to be > organized under the same roof. > h2. Inter-Process-Communication vs. Flight > I'd expect flight's functionality from the ipc names. > Flight's placement is a bit odd too, because it has its own codename, it > should be placed under cpp/src - like parquet, plasma, or gandiva. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4631) [C++] Implement serial version of sort computational kernel
[ https://issues.apache.org/jira/browse/ARROW-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4631: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Implement serial version of sort computational kernel > --- > > Key: ARROW-4631 > URL: https://issues.apache.org/jira/browse/ARROW-4631 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.13.0 >Reporter: Areg Melik-Adamyan >Assignee: Areg Melik-Adamyan >Priority: Major > Labels: analytics > Fix For: 0.15.0 > > > Implement serial version of sort computational kernel. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4591) [Rust] Add explicit SIMD vectorization for aggregation ops in "array_ops"
[ https://issues.apache.org/jira/browse/ARROW-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4591: Fix Version/s: (was: 0.14.0) > [Rust] Add explicit SIMD vectorization for aggregation ops in "array_ops" > - > > Key: ARROW-4591 > URL: https://issues.apache.org/jira/browse/ARROW-4591 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4575) [Python] Add Python Flight implementation to integration testing
[ https://issues.apache.org/jira/browse/ARROW-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4575: Fix Version/s: (was: 0.14.0) > [Python] Add Python Flight implementation to integration testing > > > Key: ARROW-4575 > URL: https://issues.apache.org/jira/browse/ARROW-4575 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Integration, Python >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: flight > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4567) [C++] Convert Scalar values to Array values with length 1
[ https://issues.apache.org/jira/browse/ARROW-4567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852594#comment-16852594 ] Wes McKinney commented on ARROW-4567: - cc [~fsaintjacques] > [C++] Convert Scalar values to Array values with length 1 > - > > Key: ARROW-4567 > URL: https://issues.apache.org/jira/browse/ARROW-4567 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > A common approach to performing operations on both scalar and array values is > to treat a Scalar as an array of length 1. For example, we cannot currently > use our Cast kernels to cast a Scalar. It would be senseless to create > separate kernel implementations specialized for a single value, and much > easier to promote a scalar to an Array, execute the kernel, then unbox the > result back into a Scalar -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4515) [C++, lint] Use clang-format more efficiently in `check-format` target
[ https://issues.apache.org/jira/browse/ARROW-4515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4515: Fix Version/s: (was: 0.14.0) > [C++, lint] Use clang-format more efficiently in `check-format` target > -- > > Key: ARROW-4515 > URL: https://issues.apache.org/jira/browse/ARROW-4515 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Benjamin Kietzman >Assignee: Benjamin Kietzman >Priority: Minor > > `clang-format` supports command line option `-output-replacements-xml` which > (in the case of no required changes) outputs: > ``` > > > > ``` > Using this option during `check-format` instead of using python to compute a > diff between formatted and on-disk should speed up that target significantly -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4534) [Rust] Build JSON reader for reading record batches from line-delimited JSON files
[ https://issues.apache.org/jira/browse/ARROW-4534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4534: Fix Version/s: (was: 0.14.0) > [Rust] Build JSON reader for reading record batches from line-delimited JSON > files > -- > > Key: ARROW-4534 > URL: https://issues.apache.org/jira/browse/ARROW-4534 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Affects Versions: 0.12.0 >Reporter: Neville Dipale >Priority: Major > > Similar to ARROW-694, this is an umbrella issue for supporting reading JSON > line-delimited files in Arrow. > I have a reference implementation at > https://github.com/nevi-me/rust-dataframe/blob/io/json/src/io/json.rs where > I'm building a Rust-based dataframe library using Arrow. > I'd like us to have feature parity with CPP at some point. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4470) [Python] Pyarrow using considerable more memory when reading partitioned Parquet file
[ https://issues.apache.org/jira/browse/ARROW-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4470: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] Pyarrow using considerable more memory when reading partitioned > Parquet file > - > > Key: ARROW-4470 > URL: https://issues.apache.org/jira/browse/ARROW-4470 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0 >Reporter: Ivan SPM >Priority: Major > Labels: datasets, parquet > Fix For: 0.15.0 > > > Hi, > I have a partitioned Parquet table in Impala in HDFS, using Hive metastore, > with the following structure: > {{/data/myparquettable/year=2016}}{{/data/myparquettable/year=2016/myfile_1.prt}} > {{/data/myparquettable/year=2016/myfile_2.prt}} > {{/data/myparquettable/year=2016/myfile_3.prt}} > {{/data/myparquettable/year=2017}} > {{/data/myparquettable/year=2017/myfile_1.prt}} > {{/data/myparquettable/year=2017/myfile_2.prt}} > {{/data/myparquettable/year=2017/myfile_3.prt}} > and so on. I need to work with one partition, so I copied one partition to a > local filesystem: > {{hdfs fs -get /data/myparquettable/year=2017 /local/}} > so now I have some data on the local disk: > {{/local/year=2017/myfile_1.prt }}{{/local/year=2017/myfile_2.prt }} > etc.I tried to read it using Pyarrow: > {{import pyarrow.parquet as pq}}{{pq.read_parquet('/local/year=2017')}} > and it starts reading. The problem is that the local Parquet files are around > 15GB total, and I blew up my machine memory a couple of times because when > reading these files, Pyarrow is using more than 60GB of RAM, and I'm not sure > how much it will take because it never finishes. Is this expected? Is there a > workaround? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4479) [Plasma] Add S3 as external store for Plasma
[ https://issues.apache.org/jira/browse/ARROW-4479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852589#comment-16852589 ] Wes McKinney commented on ARROW-4479: - What is the status of this project? > [Plasma] Add S3 as external store for Plasma > > > Key: ARROW-4479 > URL: https://issues.apache.org/jira/browse/ARROW-4479 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Plasma >Affects Versions: 0.12.0 >Reporter: Anurag Khandelwal >Assignee: Anurag Khandelwal >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Adding S3 as an external store will allow objects to be evicted to S3 when > Plasma runs out of memory capacity. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4482) [Website] Add blog archive page
[ https://issues.apache.org/jira/browse/ARROW-4482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4482: Fix Version/s: (was: 0.14.0) 0.15.0 > [Website] Add blog archive page > --- > > Key: ARROW-4482 > URL: https://issues.apache.org/jira/browse/ARROW-4482 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > There's no easy way to get a bulleted list of all blog posts on the Arrow > website. See example archive on my personal blog > http://wesmckinney.com/archives.html > It would be great to have such a generated archive on our website -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4465) [Rust] [DataFusion] Add support for ORDER BY
[ https://issues.apache.org/jira/browse/ARROW-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4465: Fix Version/s: (was: 0.14.0) > [Rust] [DataFusion] Add support for ORDER BY > > > Key: ARROW-4465 > URL: https://issues.apache.org/jira/browse/ARROW-4465 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Priority: Major > > As a user, I would like to be able to specify an ORDER BY clause on my query. > Work involved: > * Add OrderBy to LogicalPlan enum > * Write query planner code to translate SQL AST to OrderBy (SQL parser that > we use already supports parsing ORDER BY) > * Implement SortRelation > My high level thoughts on implementing the SortRelation: > * Create Arrow array of uint32 same size as batch and populate such that > each element contains its own index i.e. array will be 0, 1, 2, 3 > * Find a Rust crate for sorting that allows us to provide our own comparison > lambda > * Implement the comparison logic (probably can reuse existing execution code > - see filter.rs for how it implements comparison expressions) > * Use index array to store the result of the sort i.e. no need to rewrite > the whole batch, just the index > * Rewrite the batch after the sort has completed > It would also be good to see how Gandiva has implemented this > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4473) [Website] Add instructions to do a test-deploy of Arrow website and fix bugs
[ https://issues.apache.org/jira/browse/ARROW-4473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4473: Fix Version/s: (was: 0.14.0) 0.15.0 > [Website] Add instructions to do a test-deploy of Arrow website and fix bugs > > > Key: ARROW-4473 > URL: https://issues.apache.org/jira/browse/ARROW-4473 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > This will help with testing and proofing the website. > I have noticed that there are bugs in the website when the baseurl is not a > foo.bar.baz, e.g. if you deploy at root foo.bar.baz/test-site many images and > links are broken -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4470) [Python] Pyarrow using considerable more memory when reading partitioned Parquet file
[ https://issues.apache.org/jira/browse/ARROW-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4470: Labels: datasets parquet (was: parquet) > [Python] Pyarrow using considerable more memory when reading partitioned > Parquet file > - > > Key: ARROW-4470 > URL: https://issues.apache.org/jira/browse/ARROW-4470 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0 >Reporter: Ivan SPM >Priority: Major > Labels: datasets, parquet > Fix For: 0.14.0 > > > Hi, > I have a partitioned Parquet table in Impala in HDFS, using Hive metastore, > with the following structure: > {{/data/myparquettable/year=2016}}{{/data/myparquettable/year=2016/myfile_1.prt}} > {{/data/myparquettable/year=2016/myfile_2.prt}} > {{/data/myparquettable/year=2016/myfile_3.prt}} > {{/data/myparquettable/year=2017}} > {{/data/myparquettable/year=2017/myfile_1.prt}} > {{/data/myparquettable/year=2017/myfile_2.prt}} > {{/data/myparquettable/year=2017/myfile_3.prt}} > and so on. I need to work with one partition, so I copied one partition to a > local filesystem: > {{hdfs fs -get /data/myparquettable/year=2017 /local/}} > so now I have some data on the local disk: > {{/local/year=2017/myfile_1.prt }}{{/local/year=2017/myfile_2.prt }} > etc.I tried to read it using Pyarrow: > {{import pyarrow.parquet as pq}}{{pq.read_parquet('/local/year=2017')}} > and it starts reading. The problem is that the local Parquet files are around > 15GB total, and I blew up my machine memory a couple of times because when > reading these files, Pyarrow is using more than 60GB of RAM, and I'm not sure > how much it will take because it never finishes. Is this expected? Is there a > workaround? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-4447) [C++] Investigate dynamic linking for libthift
[ https://issues.apache.org/jira/browse/ARROW-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-4447. - Resolution: Fixed Assignee: Uwe L. Korn Thrift is now dynamically linked > [C++] Investigate dynamic linking for libthift > -- > > Key: ARROW-4447 > URL: https://issues.apache.org/jira/browse/ARROW-4447 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.14.0 > > > We're currently only linking statically against {{libthrift}} . Distributions > would often prefer a dynamic linkage to libraries where possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4453) [Python] Create Cython wrappers for SparseTensor
[ https://issues.apache.org/jira/browse/ARROW-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4453: Fix Version/s: (was: 0.14.0) > [Python] Create Cython wrappers for SparseTensor > > > Key: ARROW-4453 > URL: https://issues.apache.org/jira/browse/ARROW-4453 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Philipp Moritz >Assignee: Rok Mihevc >Priority: Minor > > We should have cython wrappers for [https://github.com/apache/arrow/pull/2546] > This is related to support for > https://issues.apache.org/jira/browse/ARROW-4223 and > https://issues.apache.org/jira/browse/ARROW-4224 > I imagine the code would be similar to > https://github.com/apache/arrow/blob/5a502d281545402240e818d5fd97a9aaf36363f2/python/pyarrow/array.pxi#L748 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4439) [C++] Improve FindBrotli.cmake
[ https://issues.apache.org/jira/browse/ARROW-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852586#comment-16852586 ] Wes McKinney commented on ARROW-4439: - [~rip@gmail.com] is this OK in master now? > [C++] Improve FindBrotli.cmake > -- > > Key: ARROW-4439 > URL: https://issues.apache.org/jira/browse/ARROW-4439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Renat Valiullin >Assignee: Renat Valiullin >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 4h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4759) [Rust] [DataFusion] It should be possible to share an execution context between threads
[ https://issues.apache.org/jira/browse/ARROW-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4759: Fix Version/s: (was: 0.14.0) > [Rust] [DataFusion] It should be possible to share an execution context > between threads > --- > > Key: ARROW-4759 > URL: https://issues.apache.org/jira/browse/ARROW-4759 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust, Rust - DataFusion >Affects Versions: 0.12.0 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > I am working on a PR for this now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4429) Add git rebase tips to the 'Contributing' page in the developer docs
[ https://issues.apache.org/jira/browse/ARROW-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4429: Fix Version/s: (was: 0.14.0) > Add git rebase tips to the 'Contributing' page in the developer docs > > > Key: ARROW-4429 > URL: https://issues.apache.org/jira/browse/ARROW-4429 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > A recent discussion on the listserv (link below) asked about how contributors > should handle rebasing. It would be helpful if the tips made it into the > developer documentation somehow. I suggest in the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > page—currently a wiki, but hopefully eventually part of the Sphinx docs > ARROW-4427. > Here is the relevant thread: > [https://lists.apache.org/thread.html/c74d8027184550b8d9041e3f2414b517ffb76ccbc1d5aa4563d364b6@%3Cdev.arrow.apache.org%3E] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5453) [C++] Just-released cmake-format 0.5.2 breaks the build
[ https://issues.apache.org/jira/browse/ARROW-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5453. - Resolution: Fixed Issue resolved by pull request 4423 [https://github.com/apache/arrow/pull/4423] > [C++] Just-released cmake-format 0.5.2 breaks the build > --- > > Key: ARROW-5453 > URL: https://issues.apache.org/jira/browse/ARROW-5453 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 20m > Remaining Estimate: 0h > > It seems we should always pin the cmake-format version until the developers > stop changing the formatting algorithm -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5455) [Rust] Build broken by 2019-05-30 Rust nightly
Wes McKinney created ARROW-5455: --- Summary: [Rust] Build broken by 2019-05-30 Rust nightly Key: ARROW-5455 URL: https://issues.apache.org/jira/browse/ARROW-5455 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Wes McKinney Fix For: 0.14.0 Seem example failed build https://travis-ci.org/apache/arrow/jobs/539477452 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5455) [Rust] Build broken by 2019-05-30 Rust nightly
[ https://issues.apache.org/jira/browse/ARROW-5455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5455: Priority: Blocker (was: Major) > [Rust] Build broken by 2019-05-30 Rust nightly > -- > > Key: ARROW-5455 > URL: https://issues.apache.org/jira/browse/ARROW-5455 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Wes McKinney >Priority: Blocker > Fix For: 0.14.0 > > > Seem example failed build > https://travis-ci.org/apache/arrow/jobs/539477452 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4419) [Flight] Deal with body buffers in FlightData
[ https://issues.apache.org/jira/browse/ARROW-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852580#comment-16852580 ] Wes McKinney commented on ARROW-4419: - [~lidavidm] where does this issue stand? > [Flight] Deal with body buffers in FlightData > - > > Key: ARROW-4419 > URL: https://issues.apache.org/jira/browse/ARROW-4419 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC >Reporter: David Li >Priority: Minor > Labels: flight > Fix For: 0.14.0 > > > The Java implementation will fail to decode a schema message if the message > also contains (empty) body buffers (see ArrowMessage.asSchema's precondition > checks). However, clients using default Protobuf serialization will likely > write an empty body buffer by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4398) [Python] Add benchmarks for Arrow<>Parquet BYTE_ARRAY serialization (read and write)
[ https://issues.apache.org/jira/browse/ARROW-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4398: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] Add benchmarks for Arrow<>Parquet BYTE_ARRAY serialization (read and > write) > > > Key: ARROW-4398 > URL: https://issues.apache.org/jira/browse/ARROW-4398 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.15.0 > > > This is follow-on work to PARQUET-1508, so we can monitor the performance of > this operation over time -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4369) [Packaging] Release verification script should test linux packages via docker
[ https://issues.apache.org/jira/browse/ARROW-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852579#comment-16852579 ] Wes McKinney commented on ARROW-4369: - [~kszucs] any thoughts about this for 0.14? We can also postpone > [Packaging] Release verification script should test linux packages via docker > - > > Key: ARROW-4369 > URL: https://issues.apache.org/jira/browse/ARROW-4369 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Priority: Major > Fix For: 0.14.0 > > > It shouldn't be too hard to create a verification script which checks the > linux packages. This could prevent issues like [ARROW-4368] / > [https://github.com/apache/arrow/issues/3476] > I suggest to separate the current verification script into one which verifies > the source release artifact and another which verifies the binaries: > * checksum and signatures as is right now > * install linux packages on multiple distros via docker > We could test wheels and conda packages as well, but in follow-up PRs. > > cc [~kou] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4409) [C++] Enable arrow::ipc internal JSON reader to read from a file path
[ https://issues.apache.org/jira/browse/ARROW-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4409: Fix Version/s: (was: 0.14.0) > [C++] Enable arrow::ipc internal JSON reader to read from a file path > - > > Key: ARROW-4409 > URL: https://issues.apache.org/jira/browse/ARROW-4409 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Minor > > This may make tests easier to write. Currently an input buffer is required, > so reading from a file requires some boilerplate -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4343) [C++] Add as complete as possible Ubuntu Trusty / 14.04 build to docker-compose setup
[ https://issues.apache.org/jira/browse/ARROW-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852576#comment-16852576 ] Wes McKinney commented on ARROW-4343: - What does it mean now that Ubuntu Trusty is no longer an LTS release? > [C++] Add as complete as possible Ubuntu Trusty / 14.04 build to > docker-compose setup > - > > Key: ARROW-4343 > URL: https://issues.apache.org/jira/browse/ARROW-4343 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > Until we formally stop supporting Trusty it would be useful to be able to > verify in Docker that builds work there. I still have an Ubuntu 14.04 machine > that I use (and I've been filing bugs that I find on it) but not sure for how > much longer -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4350) [Python] nested numpy arrays
[ https://issues.apache.org/jira/browse/ARROW-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852577#comment-16852577 ] Wes McKinney commented on ARROW-4350: - [~jorisvandenbossche] could you take a look and maybe clarify the issue title etc.? > [Python] nested numpy arrays > > > Key: ARROW-4350 > URL: https://issues.apache.org/jira/browse/ARROW-4350 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.12.0 >Reporter: yu peng >Priority: Major > Fix For: 0.14.0 > > > {code:java} > In [19]: df = pd.DataFrame({'a': [[[1], [2]], [[2], [3]]], 'b': [1, 2]}) > In [20]: df.iloc[0].to_dict() > Out[20]: {'a': [[1], [2]], 'b': 1} > In [21]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict() > Out[21]: {'a': array([array([1]), array([2])], dtype=object), 'b': 1} > In [24]: np.array(df.iloc[0].to_dict()['a']).shape > Out[24]: (2, 1) > In [25]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()['a'].shape > Out[25]: (2,) > {code} > Adding extra array type is not functioning as expected. > > More importantly, this would fail > > {code:java} > In [108]: df = pd.DataFrame({'a': [[[1, 2],[2, 3]], [[1,2], [2, 3]]], 'b': > [[1, 2],[2, 3]]}) > In [109]: df > Out[109]: > a b > 0 [[1, 2], [2, 3]] [1, 2] > 1 [[1, 2], [2, 3]] [2, 3] > In [110]: pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas()) > --- > ArrowTypeError Traceback (most recent call last) > in () > > 1 pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas()) > /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/table.pxi > in pyarrow.lib.Table.from_pandas() > 1215 > 1216 """ > -> 1217 names, arrays, metadata = pdcompat.dataframe_to_arrays( > 1218 df, > 1219 schema=schema, > /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc > in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) > 379 arrays = [convert_column(c, t) > 380 for c, t in zip(columns_to_convert, > --> 381 convert_types)] > 382 else: > 383 from concurrent import futures > /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc > in convert_column(col, ty) > 374 e.args += ("Conversion failed for column {0!s} with type {1!s}" > 375 .format(col.name, col.dtype),) > --> 376 raise e > 377 > 378 if nthreads == 1: > ArrowTypeError: ('only size-1 arrays can be converted to Python scalars', > 'Conversion failed for column a with type object') > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4324) [Python] Array dtype inference incorrect when created from list of mixed numpy scalars
[ https://issues.apache.org/jira/browse/ARROW-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852574#comment-16852574 ] Wes McKinney commented on ARROW-4324: - [~jorisvandenbossche] could you take a look? > [Python] Array dtype inference incorrect when created from list of mixed > numpy scalars > -- > > Key: ARROW-4324 > URL: https://issues.apache.org/jira/browse/ARROW-4324 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: Keith Kraus >Priority: Minor > Fix For: 0.14.0 > > > Minimal reproducer: > {code:python} > import pyarrow as pa > import numpy as np > test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)] > test_array = pa.array(test_list) > # Expected > # test_array > # > # [ > # 10, > # 0.5 > # ] > # Got > # test_array > # > # [ > # 10, > # 0 > # ] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer
[ https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4333: Fix Version/s: (was: 0.14.0) > [C++] Sketch out design for kernels and "query" execution in compute layer > -- > > Key: ARROW-4333 > URL: https://issues.apache.org/jira/browse/ARROW-4333 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Micah Kornfield >Priority: Major > Labels: analytics > > It would be good to formalize the design of kernels and the controlling query > execution layer (e.g. volcano batch model?) to understand the following: > Contracts for kernels: > * Thread safety of kernels? > * When Kernels should allocate memory vs expect preallocated memory? How to > communicate requirements for a kernels memory allocaiton? > * How to communicate the whether a kernels execution is parallelizable > across a ChunkedArray? How to determine if the order to execution across a > ChunkedArray is important? > * How to communicate when it is safe to re-use the same buffers and input > and output to the same kernel? > What does the threading model look like for the higher level of control? > Where should synchronization happen? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4337) [C#] Array / RecordBatch Builder Fluent API
[ https://issues.apache.org/jira/browse/ARROW-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4337: Fix Version/s: (was: 0.14.0) > [C#] Array / RecordBatch Builder Fluent API > --- > > Key: ARROW-4337 > URL: https://issues.apache.org/jira/browse/ARROW-4337 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Reporter: Chris Hutchinson >Assignee: Chris Hutchinson >Priority: Major > Labels: c#, pull-request-available > Original Estimate: 12h > Time Spent: 5h 10m > Remaining Estimate: 6h 50m > > Implement a fluent API for building arrays and record batches from Arrow > buffers, flat arrays, spans, enumerables, etc. > A future implementation could extend this API with support for ADO.NET > DataTables. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4309) [Release] gen_apidocs docker-compose task is out of date
[ https://issues.apache.org/jira/browse/ARROW-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4309: Fix Version/s: (was: 0.14.0) > [Release] gen_apidocs docker-compose task is out of date > > > Key: ARROW-4309 > URL: https://issues.apache.org/jira/browse/ARROW-4309 > Project: Apache Arrow > Issue Type: Bug > Components: Developer Tools, Documentation >Reporter: Wes McKinney >Priority: Major > Labels: docker > > This needs to be updated to build with CUDA support (which in turn will > require the host machine to have nvidia-docker), among other things -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-4302) [C++] Add OpenSSL to C++ build toolchain
[ https://issues.apache.org/jira/browse/ARROW-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-4302. - Resolution: Fixed > [C++] Add OpenSSL to C++ build toolchain > > > Key: ARROW-4302 > URL: https://issues.apache.org/jira/browse/ARROW-4302 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Deepak Majeti >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > This is needed for encryption support for Parquet, among other things. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4301) [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva submodule
[ https://issues.apache.org/jira/browse/ARROW-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852571#comment-16852571 ] Wes McKinney commented on ARROW-4301: - [~pravindra] any ideas about this? This will get us again in 0.14 if it is not fixed > [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva > submodule > --- > > Key: ARROW-4301 > URL: https://issues.apache.org/jira/browse/ARROW-4301 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva, Java >Reporter: Wes McKinney >Assignee: Praveen Kumar Desabandu >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h > Remaining Estimate: 0h > > See > https://github.com/apache/arrow/commit/a486db8c1476be1165981c4fe22996639da8e550. > This is breaking the build so I'm going to patch manually -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4286) [C++/R] Namespace vendored Boost
[ https://issues.apache.org/jira/browse/ARROW-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4286: Fix Version/s: (was: 0.14.0) > [C++/R] Namespace vendored Boost > > > Key: ARROW-4286 > URL: https://issues.apache.org/jira/browse/ARROW-4286 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Packaging, R >Reporter: Uwe L. Korn >Priority: Major > > For R, we vendor Boost and thus also include the symbols privately in our > modules. While they are private, some things like virtual destructors can > still interfere with other packages that vendor Boost. We should also > namespace the vendored Boost as we do in the manylinux1 packaging: > https://github.com/apache/arrow/blob/0f8bd747468dd28c909ef823bed77d8082a5b373/python/manylinux1/scripts/build_boost.sh#L28 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4217) [Plasma] Remove custom object metadata
[ https://issues.apache.org/jira/browse/ARROW-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4217: Fix Version/s: (was: 0.14.0) > [Plasma] Remove custom object metadata > -- > > Key: ARROW-4217 > URL: https://issues.apache.org/jira/browse/ARROW-4217 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Affects Versions: 0.11.1 >Reporter: Philipp Moritz >Assignee: Philipp Moritz >Priority: Minor > > Currently, Plasma supports custom metadata for objects. This doesn't seem to > be used at the moment, and removing it will simplify the interface and > implementation of plasma. Removing the custom metadata will also make > eviction to other blob stores easier (most other stores don't support custom > metadata). > My personal use case was to store arrow schemata in there, but they are now > stored as part of the object itself. > If nobody else is using this, I'd suggest removing it. If people really want > metadata, they could always store it as a separate object if desired. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4220) [Python] Add buffered input and output stream ASV benchmarks with simulated high latency IO
[ https://issues.apache.org/jira/browse/ARROW-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852570#comment-16852570 ] Wes McKinney commented on ARROW-4220: - cc [~jorisvandenbossche] > [Python] Add buffered input and output stream ASV benchmarks with simulated > high latency IO > --- > > Key: ARROW-4220 > URL: https://issues.apache.org/jira/browse/ARROW-4220 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > Follow up to ARROW-3126 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4283) [Python] Should RecordBatchStreamReader/Writer be AsyncIterable?
[ https://issues.apache.org/jira/browse/ARROW-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4283: Fix Version/s: (was: 0.14.0) > [Python] Should RecordBatchStreamReader/Writer be AsyncIterable? > > > Key: ARROW-4283 > URL: https://issues.apache.org/jira/browse/ARROW-4283 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Paul Taylor >Priority: Minor > > Filing this issue after a discussion today with [~xhochy] about how to > implement streaming pyarrow http services. I had attempted to use both Flask > and [aiohttp|https://aiohttp.readthedocs.io/en/stable/streams.html]'s > streaming interfaces because they seemed familiar, but no dice. I have no > idea how hard this would be to add -- supporting all the asynciterable > primitives in JS was non-trivial. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4259) [Plasma] CI failure in test_plasma_tf_op
[ https://issues.apache.org/jira/browse/ARROW-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4259: Fix Version/s: (was: 0.14.0) > [Plasma] CI failure in test_plasma_tf_op > > > Key: ARROW-4259 > URL: https://issues.apache.org/jira/browse/ARROW-4259 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Plasma, Continuous Integration, Python >Reporter: Wes McKinney >Priority: Major > Labels: ci-failure > > Recently-appeared failure on master: > https://travis-ci.org/apache/arrow/jobs/479378188#L7108 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4208) [CI/Python] Have automatized tests for S3
[ https://issues.apache.org/jira/browse/ARROW-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4208: Labels: filesystem s3 (was: s3) > [CI/Python] Have automatized tests for S3 > - > > Key: ARROW-4208 > URL: https://issues.apache.org/jira/browse/ARROW-4208 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Priority: Major > Labels: filesystem, s3 > Fix For: 0.14.0 > > > Currently We don't run S3 integration tests regularly. > Possible solutions: > - mock it within python/pytest > - simply run the s3 tests with an S3 credential provided > - create a hdfs-integration like docker-compose setup and run an S3 mock > server (e.g.: https://github.com/adobe/S3Mock, > https://github.com/jubos/fake-s3, https://github.com/gaul/s3proxy, > https://github.com/jserver/mock-s3) > For more see discussion https://github.com/apache/arrow/pull/3286 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4208) [CI/Python] Have automatized tests for S3
[ https://issues.apache.org/jira/browse/ARROW-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4208: Fix Version/s: (was: 0.14.0) 0.15.0 > [CI/Python] Have automatized tests for S3 > - > > Key: ARROW-4208 > URL: https://issues.apache.org/jira/browse/ARROW-4208 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Priority: Major > Labels: filesystem, s3 > Fix For: 0.15.0 > > > Currently We don't run S3 integration tests regularly. > Possible solutions: > - mock it within python/pytest > - simply run the s3 tests with an S3 credential provided > - create a hdfs-integration like docker-compose setup and run an S3 mock > server (e.g.: https://github.com/adobe/S3Mock, > https://github.com/jubos/fake-s3, https://github.com/gaul/s3proxy, > https://github.com/jserver/mock-s3) > For more see discussion https://github.com/apache/arrow/pull/3286 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4202) [Gandiva] use ArrayFromJson in tests
[ https://issues.apache.org/jira/browse/ARROW-4202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4202: Fix Version/s: (was: 0.14.0) > [Gandiva] use ArrayFromJson in tests > > > Key: ARROW-4202 > URL: https://issues.apache.org/jira/browse/ARROW-4202 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Pindikura Ravindra >Priority: Major > > Most of the gandiva tests use wrappers over ArrowFromVector. These will > become a lot more readable if we switch to ArrayFromJSON. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4146) [C++] Extend visitor functions to include ArrayBuilder and allow callable visitors
[ https://issues.apache.org/jira/browse/ARROW-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4146: Fix Version/s: (was: 0.14.0) > [C++] Extend visitor functions to include ArrayBuilder and allow callable > visitors > -- > > Key: ARROW-4146 > URL: https://issues.apache.org/jira/browse/ARROW-4146 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Benjamin Kietzman >Priority: Minor > > In addition to accepting objects with Visit methods for the visited type, > {{Visit(Array|Type)}} and {{Visit(Array|Type)Inline}} should accept objects > with overloaded call operators. > In addition for inline visitation if a visitor can only visit one of the > potential unboxings then this can be detected at compile time and the full > type_id switch can be avoided (if the unboxed object cannot be visited then > do nothing). For example: > {code} > VisitTypeInline(some_type, [](const StructType& s) { > // only execute this if some_type.id() == Type::STRUCT > }); > {code} > Finally, visit functions should be added for visiting ArrayBuilders -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4201) [C++][Gandiva] integrate test utils with arrow
[ https://issues.apache.org/jira/browse/ARROW-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4201: Fix Version/s: (was: 0.14.0) > [C++][Gandiva] integrate test utils with arrow > -- > > Key: ARROW-4201 > URL: https://issues.apache.org/jira/browse/ARROW-4201 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Pindikura Ravindra >Priority: Major > > The following tasks to be addressed as part of this Jira : > # move (or consolidate) data generators in generate_data.h to arrow > # move convenience fns in gandiva/tests/test_util.h to arrow > # move (or consolidate) EXPECT_ARROW_* fns to arrow -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4095) [C++] Implement optimizations for dictionary unification where dictionaries are prefixes of the unified dictionary
[ https://issues.apache.org/jira/browse/ARROW-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4095: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Implement optimizations for dictionary unification where dictionaries > are prefixes of the unified dictionary > -- > > Key: ARROW-4095 > URL: https://issues.apache.org/jira/browse/ARROW-4095 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > In the event that the unified dictionary contains other dictionaries as > prefixes (e.g. as the result of delta dictionaries), we can avoid memory > allocation and index transposition. > See discussion at > https://github.com/apache/arrow/pull/3165#discussion_r243020982 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4133) [C++/Python] ORC adapter should fail gracefully if /etc/timezone is missing instead of aborting
[ https://issues.apache.org/jira/browse/ARROW-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4133: Fix Version/s: (was: 0.14.0) > [C++/Python] ORC adapter should fail gracefully if /etc/timezone is missing > instead of aborting > --- > > Key: ARROW-4133 > URL: https://issues.apache.org/jira/browse/ARROW-4133 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Krisztian Szucs >Priority: Major > Labels: orc > > The following core was genereted by nightly build: > https://travis-ci.org/kszucs/crossbow/builds/473397855 > {code} > Core was generated by `/opt/conda/bin/python /opt/conda/bin/pytest -v > --pyargs pyarrow'. > Program terminated with signal SIGABRT, Aborted. > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 > 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. > [Current thread is 1 (Thread 0x7fea61f9e740 (LWP 179))] > (gdb) bt > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 > #1 0x7fea608c8801 in __GI_abort () at abort.c:79 > #2 0x7fea4b3483df in __gnu_cxx::__verbose_terminate_handler () > at > /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95 > #3 0x7fea4b346b16 in __cxxabiv1::__terminate (handler=) > at > /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47 > #4 0x7fea4b346b4c in std::terminate () > at > /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57 > #5 0x7fea4b346d28 in __cxxabiv1::__cxa_throw (obj=0x2039220, > tinfo=0x7fea494803d0 , > dest=0x7fea49087e52 ) > at > /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95 > #6 0x7fea49086824 in orc::getTimezoneByFilename (filename=...) > at /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:704 > #7 0x7fea490868d2 in orc::getLocalTimezone () at > /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:713 > > #8 0x7fea49063e59 in > orc::RowReaderImpl::RowReaderImpl (this=0x204fe30, _contents=..., opts=...) > at /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Reader.cc:185 > #9 0x7fea4906651e in orc::ReaderImpl::createRowReader (this=0x1fb41b0, > opts=...) > at /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Reader.cc:630 > #10 0x7fea48c2d904 in > arrow::adapters::orc::ORCFileReader::Impl::ReadSchema (this=0x1270600, > opts=..., > > out=0x7ffe0ccae7b0) at /arrow/cpp/src/arrow/adapters/orc/adapter.cc:264 > #11 0x7fea48c2e18d in arrow::adapters::orc::ORCFileReader::Impl::Read > (this=0x1270600, out=0x7ffe0ccaea00) > at /arrow/cpp/src/arrow/adapters/orc/adapter.cc:302 > #12 0x7fea48c2a8b9 in arrow::adapters::orc::ORCFileReader::Read > (this=0x1e14d10, out=0x7ffe0ccaea00) > at /arrow/cpp/src/arrow/adapters/orc/adapter.cc:697 > > > #13 0x7fea48218c9d in __pyx_pf_7pyarrow_4_orc_9ORCReader_12read > (__pyx_v_self=0x7fea43de8688, > __pyx_v_include_indices=0x7fea61d07b70 <_Py_NoneStruct>) at _orc.cpp:3865 > #14 0x7fea48218b31 in __pyx_pw_7pyarrow_4_orc_9ORCReader_13read > (__pyx_v_self=0x7fea43de8688, > __pyx_args=0x7fea61f5e048, __pyx_kwds=0x7fea444f78b8) at _orc.cpp:3813 > #15 0x7fea61910cbd in _PyCFunction_FastCallDict > (func_obj=func_obj@entry=0x7fea444b9558, > args=args@entry=0x7fea44a40fa8, nargs=nargs@entry=0, > kwargs=kwargs@entry=0x7fea444f78b8) > at Objects/methodobject.c:231 > #16 0x7fea61910f16 in _PyCFunction_FastCallKeywords > (func=func@entry=0x7fea444b9558, > stack=stack@entry=0x7fea44a40fa8, nargs=0, > kwnames=kwnames@entry=0x7fea47d81d30) at Objects/methodobject.c:294 > #17 0x7fea619aa0da in call_function > (pp_stack=pp_stack@entry=0x7ffe0ccaecf0, oparg=, > kwnames=kwnames@entry=0x7fea47d81d30) at Python/ceval.c:4837 > #18 0x7fea619abb46 in _PyEval_EvalFrameDefault (f=, > throwflag=) > at Python/ceval.c:3351 > #19 0x7fea619a9cde in _PyEval_EvalCodeWithName (_co=0x7fea47d9f6f0, >
[jira] [Updated] (ARROW-4090) [Python] Table.flatten() doesn't work recursively
[ https://issues.apache.org/jira/browse/ARROW-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4090: Fix Version/s: (was: 0.14.0) > [Python] Table.flatten() doesn't work recursively > - > > Key: ARROW-4090 > URL: https://issues.apache.org/jira/browse/ARROW-4090 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Francisco Sanchez >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > It seems that the pyarrow.Table.flatten() function is not working recursively > nor providing a parameter to do it. > {code} > test1c_data = {'level1-A': 'abc', >'level1-B': 112233, >'level1-C': {'x': 123.111, 'y': 123.222, 'z': 123.333} > } > test1c_type = pa.struct([('level1-A', pa.string()), > ('level1-B', pa.int32()), > ('level1-C', pa.struct([('x', pa.float64()), > ('y', pa.float64()), > ('z', pa.float64()) > ])) > ]) > test1c_array = pa.array([test1c_data]*5, type=test1c_type) > test1c_table = pa.Table.from_arrays([test1c_array], names=['msg']) > print('{}\n\n{}\n\n{}'.format(test1c_table.schema, > test1c_table.flatten().schema, > test1c_table.flatten().flatten().schema)) > {code} > output: > {quote}msg: struct double, y: double, z: double>> > child 0, level1-A: string > child 1, level1-B: int32 > child 2, level1-C: struct > child 0, x: double > child 1, y: double > child 2, z: double > msg.level1-A: string > msg.level1-B: int32 > msg.level1-C: struct > child 0, x: double > child 1, y: double > child 2, z: double > msg.level1-A: string > msg.level1-B: int32 > msg.level1-C.x: double > msg.level1-C.y: double > msg.level1-C.z: double > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4083) [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense Array (of the dictionary type)
[ https://issues.apache.org/jira/browse/ARROW-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852569#comment-16852569 ] Wes McKinney commented on ARROW-4083: - I think this could be addressed at the dataframe level, removing from any milestone for now > [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense > Array (of the dictionary type) > - > > Key: ARROW-4083 > URL: https://issues.apache.org/jira/browse/ARROW-4083 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: dataframe > > In some applications we may receive a stream of some dictionary encoded data > followed by some non-dictionary encoded data. For example this happens in > Parquet files when the dictionary reaches a certain configurable size > threshold. > We should think about how we can model this in our in-memory data structures, > and how it can flow through to relevant computational components (i.e. > certain data flow observers -- like an Aggregation -- might need to be able > to process either a dense or dictionary encoded version of a particular array > in the same stream) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4083) [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense Array (of the dictionary type)
[ https://issues.apache.org/jira/browse/ARROW-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4083: Labels: dataframe (was: ) > [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense > Array (of the dictionary type) > - > > Key: ARROW-4083 > URL: https://issues.apache.org/jira/browse/ARROW-4083 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: dataframe > Fix For: 0.14.0 > > > In some applications we may receive a stream of some dictionary encoded data > followed by some non-dictionary encoded data. For example this happens in > Parquet files when the dictionary reaches a certain configurable size > threshold. > We should think about how we can model this in our in-memory data structures, > and how it can flow through to relevant computational components (i.e. > certain data flow observers -- like an Aggregation -- might need to be able > to process either a dense or dictionary encoded version of a particular array > in the same stream) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4083) [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense Array (of the dictionary type)
[ https://issues.apache.org/jira/browse/ARROW-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4083: Fix Version/s: (was: 0.14.0) > [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense > Array (of the dictionary type) > - > > Key: ARROW-4083 > URL: https://issues.apache.org/jira/browse/ARROW-4083 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: dataframe > > In some applications we may receive a stream of some dictionary encoded data > followed by some non-dictionary encoded data. For example this happens in > Parquet files when the dictionary reaches a certain configurable size > threshold. > We should think about how we can model this in our in-memory data structures, > and how it can flow through to relevant computational components (i.e. > certain data flow observers -- like an Aggregation -- might need to be able > to process either a dense or dictionary encoded version of a particular array > in the same stream) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4076) [Python] schema validation and filters
[ https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4076: Labels: datasets easyfix parquet pull-request-available (was: easyfix parquet pull-request-available) > [Python] schema validation and filters > -- > > Key: ARROW-4076 > URL: https://issues.apache.org/jira/browse/ARROW-4076 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: George Sakkis >Priority: Minor > Labels: datasets, easyfix, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Currently [schema > validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] > of {{ParquetDataset}} takes place before filtering. This may raise a > {{ValueError}} if the schema is different in some dataset pieces, even if > these pieces would be subsequently filtered out. I think validation should > happen after filtering to prevent such spurious errors: > {noformat} > --- a/pyarrow/parquet.py > +++ b/pyarrow/parquet.py > @@ -878,13 +878,13 @@ > if split_row_groups: > raise NotImplementedError("split_row_groups not yet implemented") > > -if validate_schema: > -self.validate_schemas() > - > if filters is not None: > filters = _check_filters(filters) > self._filter(filters) > > +if validate_schema: > +self.validate_schemas() > + > def validate_schemas(self): > open_file = self._get_open_file_func() > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4057) [Python] Revamp handling of file URIs in pyarrow.parquet
[ https://issues.apache.org/jira/browse/ARROW-4057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4057: Fix Version/s: (was: 0.14.0) > [Python] Revamp handling of file URIs in pyarrow.parquet > > > Key: ARROW-4057 > URL: https://issues.apache.org/jira/browse/ARROW-4057 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > > The way this is being handled currently is pretty brittle. If the HDFS > cluster being used to run the unit tests does not support writes from > {{$USER}} then the tests fail (e.g. the only permissioned user in the > docker-compose cluster is "root", so the unit tests cannot be run) > I'm inserting various hacks to get the tests passing for now, but they are > temporary. There is code relating to path and URI handling spread throughout > the parquet module; it would be much better to centralize and clean this up -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4067) [C++] RFC: standardize ArrayBuilder subclasses
[ https://issues.apache.org/jira/browse/ARROW-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4067: Fix Version/s: (was: 0.14.0) > [C++] RFC: standardize ArrayBuilder subclasses > -- > > Key: ARROW-4067 > URL: https://issues.apache.org/jira/browse/ARROW-4067 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Benjamin Kietzman >Priority: Minor > Labels: usability > > Each builder supports different and frequently differently named methods for > appending. It should be possible to establish a more consistent convention, > which would alleviate dev confusion and simplify generics. > For example, let all Builders be required to define at minimum: > * {{Reserve(int64_t)}} > * a nested type named {{Scalar}}, which is the canonical scalar appended to > this builder. Append other types may be supported for convenience. > * {{UnsafeAppend(Scalar)}} > * {{UnsafeAppendNull()}} > The other methods described below can be overridden if an optimization is > available or left defaulted (a CRTP helper can contain the default > implementations, for example {{Append(Scalar)}} would simply be a call to > Reserve then UnsafeAppend. > In addition to their unsafe equivalents, {{Append(Scalar)}} and > {{AppendNull()}} should be available for appending without manual capacity > maintenance. > It is not necessary for the rest of this RFC, but it would simplify builders > further if scalar append methods always had a single argument. For example, > this would mean abolishing {{BinaryBuilder::Append(const uint8_t*, int32_t)}} > in favor of {{BinaryBuilder::Append(basic_string_view)}}. There's no > runtime overhead involved in this change, and developers who have a pointer > and a length instead of a view can just construct one without boilerplate > using brace initialization: {code}b->Append({pointer, length});{code} > Unsafe and safe methods should be provided for appending multiple values as > well. The default implementation will be a trivial loop but if optimizations > are available then this could be overridden (for example instead of copying > bits one by one into a BooleanBuilder, bytes could be memcpy'd). Append > methods for multiple values should accept two arguments, the first of which > contains values and the second of which defines validity. The canonical > multiple append method has signature {{Status(array_view values, > const uint8_t* valid_bytes)}}, but other overloads and helpers could be > provided as well: > {code} > b->Append({{1, 3, 4}}, all_valid); // append values with no nulls > b->Append({{1, 3, 4}}, bool_vector); // use the elements of a vector > for validity > b->Append({{1, 3, 4}}, bits(ptr)); // interpret ptr as a buffer of valid > bits, rather than valid bytes > {code} > Builders of nested types currently require developers to write boilerplate > wrangling the child builders. This could be alleviated by letting nested > builders' append methods return a helper as an output argument: > {code} > ListBuilder::List lst; > RETURN_NOT_OK(list_builder.Append()); // ListBuilder::Scalar == > ListBuilder::ListBase* > RETURN_NOT_OK(lst->Append(3)); > RETURN_NOT_OK(lst->Append(4)); > StructBuilder::Struct strct; > RETURN_NOT_OK(struct_builder.Append()); > RETURN_NOT_OK(strct.Set(0, "uuid")); > RETURN_NOT_OK(strct.Set(2, 47)); > RETURN_NOT_OK(strct->Finish()); // appends null to unspecified fields > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4022) [C++] RFC: promote Datum variant out of compute namespace
[ https://issues.apache.org/jira/browse/ARROW-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4022: Fix Version/s: (was: 0.14.0) > [C++] RFC: promote Datum variant out of compute namespace > - > > Key: ARROW-4022 > URL: https://issues.apache.org/jira/browse/ARROW-4022 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > In working on ARROW-3762, I've found it's useful to be able to have functions > return either {{Array}} or {{ChunkedArray}}. We might consider promoting the > {{arrow::compute::Datum}} variant out of {{arrow/compute/kernel.h}} so it can > be used in other places where it's helpful -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4001) [Python] Create Parquet Schema in python
[ https://issues.apache.org/jira/browse/ARROW-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4001: Fix Version/s: (was: 0.14.0) > [Python] Create Parquet Schema in python > > > Key: ARROW-4001 > URL: https://issues.apache.org/jira/browse/ARROW-4001 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Affects Versions: 0.9.0 >Reporter: David Stauffer >Priority: Major > Labels: parquet > > Enable the creation of a Parquet schema in python. For functions like > pyarrow.parquet.ParquetDataset, a schema must be a Parquet schema. See: > https://stackoverflow.com/questions/53725691/pyarrow-lib-schema-vs-pyarrow-parquet-schema -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4046) [Python/CI] Run nightly large memory tests
[ https://issues.apache.org/jira/browse/ARROW-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4046: Fix Version/s: (was: 0.14.0) > [Python/CI] Run nightly large memory tests > -- > > Key: ARROW-4046 > URL: https://issues.apache.org/jira/browse/ARROW-4046 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Priority: Major > Labels: nightly > > See comment https://github.com/apache/arrow/pull/3171#issuecomment-447156646 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4046) [Python/CI] Run nightly large memory tests
[ https://issues.apache.org/jira/browse/ARROW-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4046: Labels: nightly (was: ) > [Python/CI] Run nightly large memory tests > -- > > Key: ARROW-4046 > URL: https://issues.apache.org/jira/browse/ARROW-4046 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Priority: Major > Labels: nightly > Fix For: 0.14.0 > > > See comment https://github.com/apache/arrow/pull/3171#issuecomment-447156646 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-5445) [Website] Remove language that encourages pinning a version
[ https://issues.apache.org/jira/browse/ARROW-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou closed ARROW-5445. --- Resolution: Won't Fix https://github.com/apache/arrow/pull/4411#discussion_r288957237 {quote} Version pinning is commonplace in the Python world -- I don't think API stability has much to do with it (we will still have some API changes or deprecations after 1.0 I would guess) {quote} > [Website] Remove language that encourages pinning a version > --- > > Key: ARROW-5445 > URL: https://issues.apache.org/jira/browse/ARROW-5445 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Neal Richardson >Priority: Minor > Fix For: 1.0.0 > > > See [https://github.com/apache/arrow/pull/4411#discussion_r288804415]. > Whenever we decide to stop threatening to break APIs (1.0 release or > otherwise), purge any recommendations like this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3896) [MATLAB] Decouple MATLAB-Arrow conversion logic from Feather file specific logic
[ https://issues.apache.org/jira/browse/ARROW-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3896: Fix Version/s: (was: 0.14.0) > [MATLAB] Decouple MATLAB-Arrow conversion logic from Feather file specific > logic > > > Key: ARROW-3896 > URL: https://issues.apache.org/jira/browse/ARROW-3896 > Project: Apache Arrow > Issue Type: Improvement > Components: MATLAB >Reporter: Kevin Gurney >Assignee: Kevin Gurney >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > Currently, the logic for converting between a MATLAB mxArray and various > Arrow data structures (arrow::Table, arrow::Array, etc.) is tightly coupled > and fairly tangled up with the logic specific to handling Feather files. It > would be helpful to factor out these conversions into a more generic > "mlarrow" conversion layer component so that it can be reused in the future > for use cases other than Feather support. Furthermore, this would be helpful > to enforce a cleaner separation of concerns. > It would be nice to start off with this refactoring work up front before > adding support for more datatypes to the MATLAB featherread/featherwrite > functions, so that we can start off with a clean base upon which to expand > moving forward. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3919) [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize
[ https://issues.apache.org/jira/browse/ARROW-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3919: Fix Version/s: (was: 0.14.0) > [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize > - > > Key: ARROW-3919 > URL: https://issues.apache.org/jira/browse/ARROW-3919 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Philipp Moritz >Assignee: Philipp Moritz >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > see https://github.com/modin-project/modin/issues/266 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3901) [Python] Make Schema hashable
[ https://issues.apache.org/jira/browse/ARROW-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3901: Fix Version/s: (was: 0.14.0) > [Python] Make Schema hashable > - > > Key: ARROW-3901 > URL: https://issues.apache.org/jira/browse/ARROW-3901 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > > Currently pa.Schema is not hashable, however all of its components are > hashable -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3873) [C++] Build shared libraries consistently with -fvisibility=hidden
[ https://issues.apache.org/jira/browse/ARROW-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3873: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Build shared libraries consistently with -fvisibility=hidden > -- > > Key: ARROW-3873 > URL: https://issues.apache.org/jira/browse/ARROW-3873 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > See https://github.com/apache/arrow/pull/2437 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3873) [C++] Build shared libraries consistently with -fvisibility=hidden
[ https://issues.apache.org/jira/browse/ARROW-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852552#comment-16852552 ] Wes McKinney commented on ARROW-3873: - I might take another crack at this to see if it is doable, but after 0.14 > [C++] Build shared libraries consistently with -fvisibility=hidden > -- > > Key: ARROW-3873 > URL: https://issues.apache.org/jira/browse/ARROW-3873 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > See https://github.com/apache/arrow/pull/2437 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3801) [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable
[ https://issues.apache.org/jira/browse/ARROW-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852549#comment-16852549 ] Wes McKinney commented on ARROW-3801: - cc [~jorisvandenbossche] > [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable > > > Key: ARROW-3801 > URL: https://issues.apache.org/jira/browse/ARROW-3801 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.10.0 >Reporter: Thomas Buhrmann >Priority: Major > Fix For: 0.14.0 > > > Serializing and deserializing a pandas series with categorical dtype will > make the categorical index non-writeable, which in turn trips up pandas when > e.g. reordering the categories, raising "ValueError: buffer source array is > read-only" : > {code} > import pandas as pd > import pyarrow as pa > df = pd.Series([1,2,3], dtype='category', name="c1").to_frame() > print("DType before:", repr(df.c1.dtype)) > print("Writeable:", df.c1.cat.categories.values.flags.writeable) > ro = df.c1.cat.reorder_categories([3,2,1]) > print("DType reordered:", repr(ro.dtype), "\n") > tbl = pa.Table.from_pandas(df) > df2 = tbl.to_pandas() > print("DType after:", repr(df2.c1.dtype)) > print("Writeable:", df2.c1.cat.categories.values.flags.writeable) > ro = df2.c1.cat.reorder_categories([3,2,1]) > print("DType reordered:", repr(ro.dtype), "\n") > {code} > > Outputs: > > {code:java} > DType before: CategoricalDtype(categories=[1, 2, 3], ordered=False) > Writeable: True > DType reordered: CategoricalDtype(categories=[3, 2, 1], ordered=False) > DType after: CategoricalDtype(categories=[1, 2, 3], ordered=False) > Writeable: False > --- > ValueError Traceback (most recent call last) > in > 12 print("DType after:", repr(df2.c1.dtype)) > 13 print("Writeable:", df2.c1.cat.categories.values.flags.writeable) > ---> 14 ro = df2.c1.cat.reorder_categories([3,2,1]) > 15 print("DType reordered:", repr(ro.dtype), "\n") > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3806) [Python] When converting nested types to pandas, use tuples
[ https://issues.apache.org/jira/browse/ARROW-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3806: Fix Version/s: (was: 0.14.0) > [Python] When converting nested types to pandas, use tuples > --- > > Key: ARROW-3806 > URL: https://issues.apache.org/jira/browse/ARROW-3806 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.11.1 > Environment: Fedora 29, pyarrow installed with conda >Reporter: Suvayu Ali >Priority: Minor > Labels: pandas > > When converting to pandas, convert nested types (e.g. list) to tuples. > Columns with lists are difficult to query. Here are a few unsuccessful > attempts: > {code} > >>> mini > CHROMPOS IDREFALTS QUAL > 80 20 63521 rs191905748 G [A] 100 > 81 20 63541 rs117322527 C [A] 100 > 82 20 63548 rs541129280 G[GT] 100 > 83 20 63553 rs536661806 T [C] 100 > 84 20 63555 rs553463231 T [C] 100 > 85 20 63559 rs138359120 C [A] 100 > 86 20 63586 rs545178789 T [G] 100 > 87 20 63636 rs374311122 G [A] 100 > 88 20 63696 rs149160003 A [G] 100 > 89 20 63698 rs544072005 A [C] 100 > 90 20 63729 rs181483669 G [A] 100 > 91 20 63733 rs75670495 C [T] 100 > 92 20 63799rs1418258 C [T] 100 > 93 20 63808 rs76004960 G [C] 100 > 94 20 63813 rs532151719 G [A] 100 > 95 20 63857 rs543686274 CCTGGAAAGGATT [C] 100 > 96 20 63865 rs551938596 G [A] 100 > 97 20 63902 rs571779099 A [T] 100 > 98 20 63963 rs531152674 G [A] 100 > 99 20 63967 rs116770801 A [G] 100 > 10020 63977 rs199703510 C [G] 100 > 10120 64016 rs143263863 G [A] 100 > 10220 64062 rs148297240 G [A] 100 > 10320 64139 rs186497980 G [A, T] 100 > 10420 64150rs7274499 C [A] 100 > 10520 64151 rs190945171 C [T] 100 > 10620 64154 rs537656456 T [G] 100 > 10720 64175 rs116531220 A [G] 100 > 10820 64186 rs141793347 C [G] 100 > 10920 64210 rs182418654 G [C] 100 > 11020 64303 rs559929739 C [A] 100 > {code} > # I think this one fails because it tries to broadcast the comparison. > {code} > >>> mini[mini.ALTS == ["A", "T"]] > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line > 1283, in wrapper > res = na_op(values, other) > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line > 1143, in na_op > result = _comp_method_OBJECT_ARRAY(op, x, y) > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line > 1120, in _comp_method_OBJECT_ARRAY > result = libops.vec_compare(x, y, op) > File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare > ValueError: Arrays were different lengths: 31 vs 2 > {code} > # I think this fails due to a similar reason, but the broadcasting is > happening at a different place. > {code} > >>> mini[mini.ALTS.apply(lambda x: x == ["A", "T"])] > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", > line 2682, in __getitem__ > return self._getitem_array(key) > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", > line 2726, in _getitem_array > indexer = self.loc._convert_to_indexer(key, axis=1) > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", > line 1314, in _convert_to_indexer > indexer = check = labels.get_indexer(objarr) > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3259, in get_indexer > indexer = self._engine.get_indexer(target._ndarray_values) > File "pandas/_libs/index.pyx", line 301, in > pandas._libs.index.IndexEngine.get_indexer > File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in > pandas._libs.hashtable.PyObjectHashTable.lookup > TypeError: unhashable type: 'numpy.ndarray' > >>> mini.ALTS.apply(lambda x: x == ["A", "T"]).head() > 80 [True, False] > 81 [True, False] > 82[False, False] > 83
[jira] [Updated] (ARROW-3827) [Rust] Implement UnionArray
[ https://issues.apache.org/jira/browse/ARROW-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3827: Fix Version/s: (was: 0.14.0) > [Rust] Implement UnionArray > --- > > Key: ARROW-3827 > URL: https://issues.apache.org/jira/browse/ARROW-3827 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3789) [Python] Enable calling object in Table.to_pandas to "self-destruct" for improved memory use
[ https://issues.apache.org/jira/browse/ARROW-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3789: Fix Version/s: (was: 0.14.0) > [Python] Enable calling object in Table.to_pandas to "self-destruct" for > improved memory use > > > Key: ARROW-3789 > URL: https://issues.apache.org/jira/browse/ARROW-3789 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > > One issue with using {{Table.to_pandas}} is that it results in a memory > doubling (at least, more if there are a lot of Python objects created). It > would be useful if there was an option to destroy the {{arrow::Column}} > references once they've been transferred into the target data frame. This > would render the {{pyarrow.Table}} object useless afterward -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3764) [C++] Port Python "ParquetDataset" business logic to C++
[ https://issues.apache.org/jira/browse/ARROW-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3764: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Port Python "ParquetDataset" business logic to C++ > > > Key: ARROW-3764 > URL: https://issues.apache.org/jira/browse/ARROW-3764 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: datasets, parquet > Fix For: 0.15.0 > > > Along with defining appropriate abstractions for dealing with generic > filesystems in C++, we should implement the machinery for reading multiple > Parquet files in C++ so that it can reused in GLib, R, and Ruby. Otherwise > these languages will have to reimplement things, and this would surely result > in inconsistent features, bugs in some implementations but not others -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3759) [R][CI] Build and test on Windows in Appveyor
[ https://issues.apache.org/jira/browse/ARROW-3759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852548#comment-16852548 ] Wes McKinney commented on ARROW-3759: - cc [~npr] > [R][CI] Build and test on Windows in Appveyor > - > > Key: ARROW-3759 > URL: https://issues.apache.org/jira/browse/ARROW-3759 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3730) [Python] Output a representation of pyarrow.Schema that can be used to reconstruct a schema in a script
[ https://issues.apache.org/jira/browse/ARROW-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852547#comment-16852547 ] Wes McKinney commented on ARROW-3730: - cc [~jorisvandenbossche] > [Python] Output a representation of pyarrow.Schema that can be used to > reconstruct a schema in a script > --- > > Key: ARROW-3730 > URL: https://issues.apache.org/jira/browse/ARROW-3730 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > This would be like what {{__repr__}} is used for in many built-in Python > types, or a schema as a list of tuples that can be passed to > {{pyarrow.schema}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3758) [R] Build R library on Windows, document build instructions for Windows developers
[ https://issues.apache.org/jira/browse/ARROW-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3758: Fix Version/s: (was: 0.14.0) 0.15.0 > [R] Build R library on Windows, document build instructions for Windows > developers > -- > > Key: ARROW-3758 > URL: https://issues.apache.org/jira/browse/ARROW-3758 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)