[jira] [Updated] (ARROW-6314) [C++] Implement changes to ensure flatbuffer alignment.
[ https://issues.apache.org/jira/browse/ARROW-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6314: -- Labels: pull-request-available (was: ) > [C++] Implement changes to ensure flatbuffer alignment. > --- > > Key: ARROW-6314 > URL: https://issues.apache.org/jira/browse/ARROW-6314 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Micah Kornfield >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.15.0 > > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6372) [Rust][Datafusion] Predictate push down optimization can break query plan
Paddy Horan created ARROW-6372: -- Summary: [Rust][Datafusion] Predictate push down optimization can break query plan Key: ARROW-6372 URL: https://issues.apache.org/jira/browse/ARROW-6372 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Affects Versions: 0.14.1 Reporter: Paddy Horan Fix For: 0.15.0 The following code reproduces the issue: [https://gist.github.com/paddyhoran/598db6cbb790fc5497320613e54a02c6] If you disable the predicate push down optimization it works fine. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6371) [Doc] Row to columnar conversion example mentions arrow::Column in comments
[ https://issues.apache.org/jira/browse/ARROW-6371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917300#comment-16917300 ] Wes McKinney commented on ARROW-6371: - Thanks, can you submit a PR to fix? > [Doc] Row to columnar conversion example mentions arrow::Column in comments > --- > > Key: ARROW-6371 > URL: https://issues.apache.org/jira/browse/ARROW-6371 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation >Reporter: Omer Ozarslan >Priority: Minor > > https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html > {code:cpp} > // The final representation should be an `arrow::Table` which in turn is made > up of > // an `arrow::Schema` and a list of `arrow::Column`. An `arrow::Column` is > again a > // named collection of one or more `arrow::Array` instances. As the first > step, we > // will iterate over the data and build up the arrays incrementally. > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-3829) [Python] Support protocols to extract Arrow objects from third-party classes
[ https://issues.apache.org/jira/browse/ARROW-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3829. - Resolution: Fixed Issue resolved by pull request 5106 [https://github.com/apache/arrow/pull/5106] > [Python] Support protocols to extract Arrow objects from third-party classes > > > Key: ARROW-3829 > URL: https://issues.apache.org/jira/browse/ARROW-3829 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 5h > Remaining Estimate: 0h > > In the style of NumPy's {{__array__}}, we should be able to ask inputs to > {{pa.array}}, {{pa.Table.from_X}}, ... whether they can convert themselves to > Arrow objects. This would allow for example to turn objects that hold an > Arrow object internally to expose them directly instead of going a conversion > path. > My current use case involves Pandas {{ExtensionArray}} instances that > internally have Arrow objects and should be reused when we pass the whole > {{DataFrame}} to {{pa.Table.from_pandas}}. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6358) [C++] FileSystem::DeleteDir should make it optional to delete the directory itself
[ https://issues.apache.org/jira/browse/ARROW-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917191#comment-16917191 ] Rok Mihevc commented on ARROW-6358: --- Got it. Thanks! > [C++] FileSystem::DeleteDir should make it optional to delete the directory > itself > -- > > Key: ARROW-6358 > URL: https://issues.apache.org/jira/browse/ARROW-6358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.14.1 >Reporter: Antoine Pitrou >Priority: Major > > In some situations, it can be desirable to delete the entirety of a > directory's contents, but not the directory itself (e.g. when it's a S3 > bucket). Perhaps we should add an option for that. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6358) [C++] FileSystem::DeleteDir should make it optional to delete the directory itself
[ https://issues.apache.org/jira/browse/ARROW-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917147#comment-16917147 ] Antoine Pitrou commented on ARROW-6358: --- This is doable using {{SubTreeFileSystem}}. > [C++] FileSystem::DeleteDir should make it optional to delete the directory > itself > -- > > Key: ARROW-6358 > URL: https://issues.apache.org/jira/browse/ARROW-6358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.14.1 >Reporter: Antoine Pitrou >Priority: Major > > In some situations, it can be desirable to delete the entirety of a > directory's contents, but not the directory itself (e.g. when it's a S3 > bucket). Perhaps we should add an option for that. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6358) [C++] FileSystem::DeleteDir should make it optional to delete the directory itself
[ https://issues.apache.org/jira/browse/ARROW-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917142#comment-16917142 ] Rok Mihevc commented on ARROW-6358: --- Can you treat bucket as root of the filesystem? > [C++] FileSystem::DeleteDir should make it optional to delete the directory > itself > -- > > Key: ARROW-6358 > URL: https://issues.apache.org/jira/browse/ARROW-6358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.14.1 >Reporter: Antoine Pitrou >Priority: Major > > In some situations, it can be desirable to delete the entirety of a > directory's contents, but not the directory itself (e.g. when it's a S3 > bucket). Perhaps we should add an option for that. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6371) [Doc] Row to columnar conversion example mentions arrow::Column in comments
Omer Ozarslan created ARROW-6371: Summary: [Doc] Row to columnar conversion example mentions arrow::Column in comments Key: ARROW-6371 URL: https://issues.apache.org/jira/browse/ARROW-6371 Project: Apache Arrow Issue Type: Bug Components: Documentation Reporter: Omer Ozarslan https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html {code:cpp} // The final representation should be an `arrow::Table` which in turn is made up of // an `arrow::Schema` and a list of `arrow::Column`. An `arrow::Column` is again a // named collection of one or more `arrow::Array` instances. As the first step, we // will iterate over the data and build up the arrays incrementally. {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (ARROW-5960) [C++] Boost dependencies are specified in wrong order
[ https://issues.apache.org/jira/browse/ARROW-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-5960: - Assignee: Ingo Müller > [C++] Boost dependencies are specified in wrong order > - > > Key: ARROW-5960 > URL: https://issues.apache.org/jira/browse/ARROW-5960 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.0 >Reporter: Ingo Müller >Assignee: Ingo Müller >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The boost dependencies in cpp/CMakeLists.txt are specified in the wrong > order: the system library currently comes first, followed by the filesystem > library. They should be specified in the opposite order, as filesystem > depends on system. > It seems to depend on the version of boost or how it is compiled whether this > problem becomes apparent. I am currently setting up the project like this: > {code:java} > CXX=clang++-7.0 CC=clang-7.0 \ > cmake \ > -DCMAKE_CXX_STANDARD=17 \ > -DCMAKE_INSTALL_PREFIX=/tmp/arrow4/dist \ > -DCMAKE_INSTALL_LIBDIR=lib \ > -DARROW_WITH_RAPIDJSON=ON \ > -DARROW_PARQUET=ON \ > -DARROW_PYTHON=ON \ > -DARROW_FLIGHT=OFF \ > -DARROW_GANDIVA=OFF \ > -DARROW_BUILD_UTILITIES=OFF \ > -DARROW_CUDA=OFF \ > -DARROW_ORC=OFF \ > -DARROW_JNI=OFF \ > -DARROW_TENSORFLOW=OFF \ > -DARROW_HDFS=OFF \ > -DARROW_BUILD_TESTS=OFF \ > -DARROW_RPATH_ORIGIN=ON \ > ..{code} > After compiling, I libarrow.so is missing symbols: > {code:java} > nm -C /dist/lib/libarrow.so | grep boost::system::system_c > U boost::system::system_category(){code} > It seems like this is related to whether or not boost has been compiled with > {{BOOST_SYSTEM_NO_DEPRECATED}}. (according to [this > post|https://stackoverflow.com/a/30877725/651937], anyway). I have to say > that I don't understand why boost as BUNDLED should be compiled that way... > If I apply the following patch, everything works as expected: > > {code:java} > diff -pur a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt > --- a/cpp/CMakeLists.txt 2019-06-29 00:26:37.0 +0200 > +++ b/cpp/CMakeLists.txt 2019-07-16 16:36:03.980153919 +0200 > @@ -642,8 +642,8 @@ if(ARROW_STATIC_LINK_LIBS) > add_dependencies(arrow_dependencies ${ARROW_STATIC_LINK_LIBS}) > endif() > -set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} > ${BOOST_SYSTEM_LIBRARY} > - ${BOOST_FILESYSTEM_LIBRARY} > ${BOOST_REGEX_LIBRARY}) > +set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} > ${BOOST_FILESYSTEM_LIBRARY} > + ${BOOST_SYSTEM_LIBRARY} > ${BOOST_REGEX_LIBRARY}) > list(APPEND ARROW_STATIC_LINK_LIBS ${BOOST_SYSTEM_LIBRARY} > ${BOOST_FILESYSTEM_LIBRARY} > ${BOOST_REGEX_LIBRARY}){code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-5960) [C++] Boost dependencies are specified in wrong order
[ https://issues.apache.org/jira/browse/ARROW-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-5960. --- Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 5205 [https://github.com/apache/arrow/pull/5205] > [C++] Boost dependencies are specified in wrong order > - > > Key: ARROW-5960 > URL: https://issues.apache.org/jira/browse/ARROW-5960 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.0 >Reporter: Ingo Müller >Priority: Minor > Fix For: 0.15.0 > > > The boost dependencies in cpp/CMakeLists.txt are specified in the wrong > order: the system library currently comes first, followed by the filesystem > library. They should be specified in the opposite order, as filesystem > depends on system. > It seems to depend on the version of boost or how it is compiled whether this > problem becomes apparent. I am currently setting up the project like this: > {code:java} > CXX=clang++-7.0 CC=clang-7.0 \ > cmake \ > -DCMAKE_CXX_STANDARD=17 \ > -DCMAKE_INSTALL_PREFIX=/tmp/arrow4/dist \ > -DCMAKE_INSTALL_LIBDIR=lib \ > -DARROW_WITH_RAPIDJSON=ON \ > -DARROW_PARQUET=ON \ > -DARROW_PYTHON=ON \ > -DARROW_FLIGHT=OFF \ > -DARROW_GANDIVA=OFF \ > -DARROW_BUILD_UTILITIES=OFF \ > -DARROW_CUDA=OFF \ > -DARROW_ORC=OFF \ > -DARROW_JNI=OFF \ > -DARROW_TENSORFLOW=OFF \ > -DARROW_HDFS=OFF \ > -DARROW_BUILD_TESTS=OFF \ > -DARROW_RPATH_ORIGIN=ON \ > ..{code} > After compiling, I libarrow.so is missing symbols: > {code:java} > nm -C /dist/lib/libarrow.so | grep boost::system::system_c > U boost::system::system_category(){code} > It seems like this is related to whether or not boost has been compiled with > {{BOOST_SYSTEM_NO_DEPRECATED}}. (according to [this > post|https://stackoverflow.com/a/30877725/651937], anyway). I have to say > that I don't understand why boost as BUNDLED should be compiled that way... > If I apply the following patch, everything works as expected: > > {code:java} > diff -pur a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt > --- a/cpp/CMakeLists.txt 2019-06-29 00:26:37.0 +0200 > +++ b/cpp/CMakeLists.txt 2019-07-16 16:36:03.980153919 +0200 > @@ -642,8 +642,8 @@ if(ARROW_STATIC_LINK_LIBS) > add_dependencies(arrow_dependencies ${ARROW_STATIC_LINK_LIBS}) > endif() > -set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} > ${BOOST_SYSTEM_LIBRARY} > - ${BOOST_FILESYSTEM_LIBRARY} > ${BOOST_REGEX_LIBRARY}) > +set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} > ${BOOST_FILESYSTEM_LIBRARY} > + ${BOOST_SYSTEM_LIBRARY} > ${BOOST_REGEX_LIBRARY}) > list(APPEND ARROW_STATIC_LINK_LIBS ${BOOST_SYSTEM_LIBRARY} > ${BOOST_FILESYSTEM_LIBRARY} > ${BOOST_REGEX_LIBRARY}){code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-5960) [C++] Boost dependencies are specified in wrong order
[ https://issues.apache.org/jira/browse/ARROW-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5960: -- Labels: pull-request-available (was: ) > [C++] Boost dependencies are specified in wrong order > - > > Key: ARROW-5960 > URL: https://issues.apache.org/jira/browse/ARROW-5960 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.0 >Reporter: Ingo Müller >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > > The boost dependencies in cpp/CMakeLists.txt are specified in the wrong > order: the system library currently comes first, followed by the filesystem > library. They should be specified in the opposite order, as filesystem > depends on system. > It seems to depend on the version of boost or how it is compiled whether this > problem becomes apparent. I am currently setting up the project like this: > {code:java} > CXX=clang++-7.0 CC=clang-7.0 \ > cmake \ > -DCMAKE_CXX_STANDARD=17 \ > -DCMAKE_INSTALL_PREFIX=/tmp/arrow4/dist \ > -DCMAKE_INSTALL_LIBDIR=lib \ > -DARROW_WITH_RAPIDJSON=ON \ > -DARROW_PARQUET=ON \ > -DARROW_PYTHON=ON \ > -DARROW_FLIGHT=OFF \ > -DARROW_GANDIVA=OFF \ > -DARROW_BUILD_UTILITIES=OFF \ > -DARROW_CUDA=OFF \ > -DARROW_ORC=OFF \ > -DARROW_JNI=OFF \ > -DARROW_TENSORFLOW=OFF \ > -DARROW_HDFS=OFF \ > -DARROW_BUILD_TESTS=OFF \ > -DARROW_RPATH_ORIGIN=ON \ > ..{code} > After compiling, I libarrow.so is missing symbols: > {code:java} > nm -C /dist/lib/libarrow.so | grep boost::system::system_c > U boost::system::system_category(){code} > It seems like this is related to whether or not boost has been compiled with > {{BOOST_SYSTEM_NO_DEPRECATED}}. (according to [this > post|https://stackoverflow.com/a/30877725/651937], anyway). I have to say > that I don't understand why boost as BUNDLED should be compiled that way... > If I apply the following patch, everything works as expected: > > {code:java} > diff -pur a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt > --- a/cpp/CMakeLists.txt 2019-06-29 00:26:37.0 +0200 > +++ b/cpp/CMakeLists.txt 2019-07-16 16:36:03.980153919 +0200 > @@ -642,8 +642,8 @@ if(ARROW_STATIC_LINK_LIBS) > add_dependencies(arrow_dependencies ${ARROW_STATIC_LINK_LIBS}) > endif() > -set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} > ${BOOST_SYSTEM_LIBRARY} > - ${BOOST_FILESYSTEM_LIBRARY} > ${BOOST_REGEX_LIBRARY}) > +set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} > ${BOOST_FILESYSTEM_LIBRARY} > + ${BOOST_SYSTEM_LIBRARY} > ${BOOST_REGEX_LIBRARY}) > list(APPEND ARROW_STATIC_LINK_LIBS ${BOOST_SYSTEM_LIBRARY} > ${BOOST_FILESYSTEM_LIBRARY} > ${BOOST_REGEX_LIBRARY}){code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6242) [C++] Implements basic Dataset/Scanner/ScannerBuilder
[ https://issues.apache.org/jira/browse/ARROW-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6242: -- Labels: dataset pull-request-available (was: dataset) > [C++] Implements basic Dataset/Scanner/ScannerBuilder > - > > Key: ARROW-6242 > URL: https://issues.apache.org/jira/browse/ARROW-6242 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset, pull-request-available > > The goal of this would be to iterate over a Dataset and generate a > "flattened" stream of RecordBatches from the union of data sources and data > fragments. This should not bother with filtering yet. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6364) [R] Handling unexpected input to time64() et al
[ https://issues.apache.org/jira/browse/ARROW-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-6364. --- Resolution: Fixed Issue resolved by pull request 5201 [https://github.com/apache/arrow/pull/5201] > [R] Handling unexpected input to time64() et al > --- > > Key: ARROW-6364 > URL: https://issues.apache.org/jira/browse/ARROW-6364 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > {code:r} > > time64() > Error in Time64__initialize(unit) : > argument "unit" is missing, with no default > > time64("ms") > Error in Time64__initialize(unit) : > Not compatible with requested type: [type=character; target=integer]. > > time64(1) > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0826 11:13:34.657388 162407872 type.cc:234] Check failed: unit == > TimeUnit::MICRO || unit == TimeUnit::NANO Must be microseconds or nanoseconds > *** Check failure stack trace: *** > Abort trap: 6 > > time64(1L) > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0826 11:14:09.445202 251229632 type.cc:234] Check failed: unit == > TimeUnit::MICRO || unit == TimeUnit::NANO Must be microseconds or nanoseconds > *** Check failure stack trace: *** > Abort trap: 6 > > time64("MILLI") > Error in Time64__initialize(unit) : > Not compatible with requested type: [type=character; target=integer]. > > time64(TimeUnit$MILLI) > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0826 11:15:12.047847 361547200 type.cc:234] Check failed: unit == > TimeUnit::MICRO || unit == TimeUnit::NANO Must be microseconds or nanoseconds > *** Check failure stack trace: *** > Abort trap: 6 > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6370) [JS] Table.from adds 0 on int columns
Sascha Hofmann created ARROW-6370: - Summary: [JS] Table.from adds 0 on int columns Key: ARROW-6370 URL: https://issues.apache.org/jira/browse/ARROW-6370 Project: Apache Arrow Issue Type: Bug Components: JavaScript Affects Versions: 0.14.1 Reporter: Sascha Hofmann I am generating an arrow table in pyarrow and send it via gRPC like this: {code:java} sink = pa.BufferOutputStream() writer = pa.RecordBatchStreamWriter(sink, batch.schema) writer.write_batch(batch) writer.close() yield ds.Response( status=200, loading=False, response=[sink.getvalue().to_pybytes()] ) {code} On the javascript end, I parse it like that: {code:java} Table.from(response.getResponseList()[0]) {code} That works but when I look at the actual table, int columns have a 0 for every other row. String columns seem to be parsed just fine. The Python byte array created from to_pybytes() has the same length as received in javascript. I am also able to recreate the original table for the byte array in Python. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6368) [C++] Add RecordBatch projection functionality
[ https://issues.apache.org/jira/browse/ARROW-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916880#comment-16916880 ] Benjamin Kietzman commented on ARROW-6368: -- If the operation is potentially expensive then we might want to avoid hiding it in the projector; in the case of augmenting as described above the columns to augment can be generated once then reused for each yielded batch. > [C++] Add RecordBatch projection functionality > -- > > Key: ARROW-6368 > URL: https://issues.apache.org/jira/browse/ARROW-6368 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Benjamin Kietzman >Assignee: Benjamin Kietzman >Priority: Minor > Labels: dataset > > define classes RecordBatchProjector (which projects from one schema to > another, augmenting with null/constant columns where necessary) and a subtype > of RecordBatchIterator which projects each batch yielded by a wrapped > iterator. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5101) [Packaging] Avoid bundling static libraries in Windows conda packages
[ https://issues.apache.org/jira/browse/ARROW-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916825#comment-16916825 ] Wes McKinney commented on ARROW-5101: - [~kszucs] can you check? Do you have a Windows VM? > [Packaging] Avoid bundling static libraries in Windows conda packages > - > > Key: ARROW-5101 > URL: https://issues.apache.org/jira/browse/ARROW-5101 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Packaging >Affects Versions: 0.13.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: conda > Fix For: 0.15.0 > > > We're currently bundling static libraries in Windows conda packages. > Unfortunately, it causes these to be quite large: > {code:bash} > $ ls -la ./Library/lib > total 507808 > drwxrwxr-x 4 antoine antoine 4096 avril 3 10:28 . > drwxrwxr-x 5 antoine antoine 4096 avril 3 10:28 .. > -rw-rw-r-- 1 antoine antoine 1507048 avril 1 20:58 arrow.lib > -rw-rw-r-- 1 antoine antoine 76184 avril 1 20:59 arrow_python.lib > -rw-rw-r-- 1 antoine antoine 61323846 avril 1 21:00 arrow_python_static.lib > -rw-rw-r-- 1 antoine antoine 32809 avril 1 21:02 arrow_static.lib > drwxrwxr-x 3 antoine antoine 4096 avril 3 10:28 cmake > -rw-rw-r-- 1 antoine antoine491292 avril 1 21:02 parquet.lib > -rw-rw-r-- 1 antoine antoine 128473780 avril 1 21:03 parquet_static.lib > drwxrwxr-x 2 antoine antoine 4096 avril 3 10:27 pkgconfig > {code} > (see files in https://anaconda.org/conda-forge/arrow-cpp/files ) > We should probably only ship dynamic libraries under Windows, as those are > reasonably small. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6358) [C++] FileSystem::DeleteDir should make it optional to delete the directory itself
[ https://issues.apache.org/jira/browse/ARROW-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916821#comment-16916821 ] Antoine Pitrou commented on ARROW-6358: --- I don't think generic code should have to bother about the notion of "bucket", though. I'll have to think about it more. > [C++] FileSystem::DeleteDir should make it optional to delete the directory > itself > -- > > Key: ARROW-6358 > URL: https://issues.apache.org/jira/browse/ARROW-6358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.14.1 >Reporter: Antoine Pitrou >Priority: Major > > In some situations, it can be desirable to delete the entirety of a > directory's contents, but not the directory itself (e.g. when it's a S3 > bucket). Perhaps we should add an option for that. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6369) [Python] Support list-of-boolean in Array.to_pandas conversion
Wes McKinney created ARROW-6369: --- Summary: [Python] Support list-of-boolean in Array.to_pandas conversion Key: ARROW-6369 URL: https://issues.apache.org/jira/browse/ARROW-6369 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.15.0 See {code} In [4]: paste a = pa.array(np.array([[True, False], [True, True, True]])) ## -- End pasted text -- In [5]: a Out[5]: [ [ true, false ], [ true, true, true ] ] In [6]: a.to_pandas() --- ArrowNotImplementedError Traceback (most recent call last) in > 1 a.to_pandas() ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas() 439 deduplicate_objects=deduplicate_objects) 440 --> 441 return self._to_pandas(options, categories=categories, 442ignore_metadata=ignore_metadata) 443 ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array._to_pandas() 815 816 with nogil: --> 817 check_status(ConvertArrayToPandas(c_options, self.sp_array, 818 self, )) 819 return wrap_array_output(out) ~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() 84 raise ArrowKeyError(message) 85 elif status.IsNotImplemented(): ---> 86 raise ArrowNotImplementedError(message) 87 elif status.IsTypeError(): 88 raise ArrowTypeError(message) ArrowNotImplementedError: Not implemented type for lists: bool In ../src/arrow/python/arrow_to_pandas.cc, line 1910, code: VisitTypeInline(*data_->type(), this) {code} as reported in https://github.com/apache/arrow/issues/5203 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6358) [C++] FileSystem::DeleteDir should make it optional to delete the directory itself
[ https://issues.apache.org/jira/browse/ARROW-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916819#comment-16916819 ] Wes McKinney commented on ARROW-6358: - So another way to "skin this cat" is simply to not delete buckets in any of the generic filesystem APIs, and instead add a "DeleteBucket" function that is S3-specific. Not sure how challenging this would be to incorporate into the testing regimen > [C++] FileSystem::DeleteDir should make it optional to delete the directory > itself > -- > > Key: ARROW-6358 > URL: https://issues.apache.org/jira/browse/ARROW-6358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.14.1 >Reporter: Antoine Pitrou >Priority: Major > > In some situations, it can be desirable to delete the entirety of a > directory's contents, but not the directory itself (e.g. when it's a S3 > bucket). Perhaps we should add an option for that. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6301) [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type found'
[ https://issues.apache.org/jira/browse/ARROW-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916814#comment-16916814 ] Wes McKinney commented on ARROW-6301: - [~klichukb] do you still have the segfault on the master branch? > [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name > arrow.py_extension_type found' > --- > > Key: ARROW-6301 > URL: https://issues.apache.org/jira/browse/ARROW-6301 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 > Environment: linux, virtualenv, uwsgi, cpython 2.7 >Reporter: David Alphus >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 40m > Remaining Estimate: 0h > > On interrupt, I am frequently seeing the atexit function failing in pyarrow > 0.14.1. > {code:java} > ^CSIGINT/SIGQUIT received...killing workers... > killing the spooler with pid 22640 > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in > _run_exitfuncs > func(*targs, **kargs) > File "pyarrow/types.pxi", line 1860, in > pyarrow.lib._unregister_py_extension_type > check_status(UnregisterPyExtensionType()) > File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status > raise ArrowKeyError(message) > ArrowKeyError: 'No type extension with name arrow.py_extension_type found' > Error in sys.exitfunc: > Traceback (most recent call last): > File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in > _run_exitfuncs > func(*targs, **kargs) > File "pyarrow/types.pxi", line 1860, in > pyarrow.lib._unregister_py_extension_type > File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status > pyarrow.lib.ArrowKeyError: 'No type extension with name > arrow.py_extension_type found' > spooler (pid: 22640) annihilated > worker 1 buried after 1 seconds > goodbye to uWSGI.{code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5131) [Python] Add Azure Datalake Filesystem Gen1 Wrapper for pyarrow
[ https://issues.apache.org/jira/browse/ARROW-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916809#comment-16916809 ] Gregory Hayes commented on ARROW-5131: -- Thanks. We can close this, knowing that’s the strategic direction. > [Python] Add Azure Datalake Filesystem Gen1 Wrapper for pyarrow > --- > > Key: ARROW-5131 > URL: https://issues.apache.org/jira/browse/ARROW-5131 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 0.12.1 >Reporter: Gregory Hayes >Priority: Minor > Labels: pull-request-available > Time Spent: 5.5h > Remaining Estimate: 0h > > The current pyarrow package can only read parquet files that have been > written to Gen1 Azure Datalake using the fastparquet engine. This only works > if the dask-adlfs package is explicitly installed and imported. I've added a > method to the dask-adlfs package, found > [here|https://github.com/dask/dask-adlfs], and issued a PR for that change. > To support this capability, added an ADLFSWrapper to filesystem.py file. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3 with s3fs
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916807#comment-16916807 ] Wes McKinney commented on ARROW-6058: - Seems this can be fixed now by upgrading to {{fsspec=0.4.2}} > [Python][Parquet] Failure when reading Parquet file from S3 with s3fs > - > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.1 >Reporter: Siddharth >Assignee: Wes McKinney >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h > Remaining Estimate: 0h > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3 with s3fs
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916807#comment-16916807 ] Wes McKinney edited comment on ARROW-6058 at 8/27/19 3:28 PM: -- Seems this can be fixed now by upgrading to {{fsspec==0.4.2}} was (Author: wesmckinn): Seems this can be fixed now by upgrading to {{fsspec=0.4.2}} > [Python][Parquet] Failure when reading Parquet file from S3 with s3fs > - > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.1 >Reporter: Siddharth >Assignee: Wes McKinney >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h > Remaining Estimate: 0h > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6301) [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type found'
[ https://issues.apache.org/jira/browse/ARROW-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916799#comment-16916799 ] Antoine Pitrou commented on ARROW-6301: --- [~klichukb] not to my knowledge, can you open a new issue? > [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name > arrow.py_extension_type found' > --- > > Key: ARROW-6301 > URL: https://issues.apache.org/jira/browse/ARROW-6301 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 > Environment: linux, virtualenv, uwsgi, cpython 2.7 >Reporter: David Alphus >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 40m > Remaining Estimate: 0h > > On interrupt, I am frequently seeing the atexit function failing in pyarrow > 0.14.1. > {code:java} > ^CSIGINT/SIGQUIT received...killing workers... > killing the spooler with pid 22640 > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in > _run_exitfuncs > func(*targs, **kargs) > File "pyarrow/types.pxi", line 1860, in > pyarrow.lib._unregister_py_extension_type > check_status(UnregisterPyExtensionType()) > File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status > raise ArrowKeyError(message) > ArrowKeyError: 'No type extension with name arrow.py_extension_type found' > Error in sys.exitfunc: > Traceback (most recent call last): > File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in > _run_exitfuncs > func(*targs, **kargs) > File "pyarrow/types.pxi", line 1860, in > pyarrow.lib._unregister_py_extension_type > File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status > pyarrow.lib.ArrowKeyError: 'No type extension with name > arrow.py_extension_type found' > spooler (pid: 22640) annihilated > worker 1 buried after 1 seconds > goodbye to uWSGI.{code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6301) [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type found'
[ https://issues.apache.org/jira/browse/ARROW-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916798#comment-16916798 ] Bogdan Klichuk commented on ARROW-6301: --- Bumping this thread with related segfault, that [~david.alphus] has during uWSGI atexit. I have a custom atexit handler during uWSGI graceful shutdown, which uses pyarrow code. Getting segfault. Is there an issue created for this? > [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name > arrow.py_extension_type found' > --- > > Key: ARROW-6301 > URL: https://issues.apache.org/jira/browse/ARROW-6301 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 > Environment: linux, virtualenv, uwsgi, cpython 2.7 >Reporter: David Alphus >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 40m > Remaining Estimate: 0h > > On interrupt, I am frequently seeing the atexit function failing in pyarrow > 0.14.1. > {code:java} > ^CSIGINT/SIGQUIT received...killing workers... > killing the spooler with pid 22640 > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in > _run_exitfuncs > func(*targs, **kargs) > File "pyarrow/types.pxi", line 1860, in > pyarrow.lib._unregister_py_extension_type > check_status(UnregisterPyExtensionType()) > File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status > raise ArrowKeyError(message) > ArrowKeyError: 'No type extension with name arrow.py_extension_type found' > Error in sys.exitfunc: > Traceback (most recent call last): > File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in > _run_exitfuncs > func(*targs, **kargs) > File "pyarrow/types.pxi", line 1860, in > pyarrow.lib._unregister_py_extension_type > File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status > pyarrow.lib.ArrowKeyError: 'No type extension with name > arrow.py_extension_type found' > spooler (pid: 22640) annihilated > worker 1 buried after 1 seconds > goodbye to uWSGI.{code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6368) [C++] Add RecordBatch projection functionality
[ https://issues.apache.org/jira/browse/ARROW-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916790#comment-16916790 ] Wes McKinney commented on ARROW-6368: - You might consider making this general enough to handle type alterations or other operations, too. This could be addressed later also > [C++] Add RecordBatch projection functionality > -- > > Key: ARROW-6368 > URL: https://issues.apache.org/jira/browse/ARROW-6368 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Benjamin Kietzman >Assignee: Benjamin Kietzman >Priority: Minor > Labels: dataset > > define classes RecordBatchProjector (which projects from one schema to > another, augmenting with null/constant columns where necessary) and a subtype > of RecordBatchIterator which projects each batch yielded by a wrapped > iterator. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6368) [C++] Add RecordBatch projection functionality
Benjamin Kietzman created ARROW-6368: Summary: [C++] Add RecordBatch projection functionality Key: ARROW-6368 URL: https://issues.apache.org/jira/browse/ARROW-6368 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Benjamin Kietzman Assignee: Benjamin Kietzman define classes RecordBatchProjector (which projects from one schema to another, augmenting with null/constant columns where necessary) and a subtype of RecordBatchIterator which projects each batch yielded by a wrapped iterator. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (ARROW-2769) [Python] Deprecate and rename add_metadata methods
[ https://issues.apache.org/jira/browse/ARROW-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs reassigned ARROW-2769: -- Assignee: Krisztian Szucs > [Python] Deprecate and rename add_metadata methods > -- > > Key: ARROW-2769 > URL: https://issues.apache.org/jira/browse/ARROW-2769 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Minor > Fix For: 0.15.0 > > > Deprecate and replace `pyarrow.Field.add_metadata` (and other likely named > methods) with replace_metadata, set_metadata or with_metadata. Knowing > Spark's immutable API, I would have chosen with_metadata but I guess this is > probably not what the average Python user would expect as naming. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6324) [C++] File system API should expand paths
[ https://issues.apache.org/jira/browse/ARROW-6324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916747#comment-16916747 ] Antoine Pitrou commented on ARROW-6324: --- Some possibilities: * do it implicitly in LocalFileSystem * do it explicitly in a dedicated convenience API * do it explicitly in a generic path conversion layer (that could also do other things, e.g. strip trailing slashes on S3) > [C++] File system API should expand paths > - > > Key: ARROW-6324 > URL: https://issues.apache.org/jira/browse/ARROW-6324 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Priority: Minor > Labels: filesystem > > See ARROW-6323 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5101) [Packaging] Avoid bundling static libraries in Windows conda packages
[ https://issues.apache.org/jira/browse/ARROW-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916745#comment-16916745 ] Krisztian Szucs commented on ARROW-5101: This should have been resolved by https://github.com/conda-forge/arrow-cpp-feedstock/commit/d6e21db3f1f1da713194c305a91eb6e4b3b3a1d4 [~pitrou] could you check with version 0.14? > [Packaging] Avoid bundling static libraries in Windows conda packages > - > > Key: ARROW-5101 > URL: https://issues.apache.org/jira/browse/ARROW-5101 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Packaging >Affects Versions: 0.13.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: conda > Fix For: 0.15.0 > > > We're currently bundling static libraries in Windows conda packages. > Unfortunately, it causes these to be quite large: > {code:bash} > $ ls -la ./Library/lib > total 507808 > drwxrwxr-x 4 antoine antoine 4096 avril 3 10:28 . > drwxrwxr-x 5 antoine antoine 4096 avril 3 10:28 .. > -rw-rw-r-- 1 antoine antoine 1507048 avril 1 20:58 arrow.lib > -rw-rw-r-- 1 antoine antoine 76184 avril 1 20:59 arrow_python.lib > -rw-rw-r-- 1 antoine antoine 61323846 avril 1 21:00 arrow_python_static.lib > -rw-rw-r-- 1 antoine antoine 32809 avril 1 21:02 arrow_static.lib > drwxrwxr-x 3 antoine antoine 4096 avril 3 10:28 cmake > -rw-rw-r-- 1 antoine antoine491292 avril 1 21:02 parquet.lib > -rw-rw-r-- 1 antoine antoine 128473780 avril 1 21:03 parquet_static.lib > drwxrwxr-x 2 antoine antoine 4096 avril 3 10:27 pkgconfig > {code} > (see files in https://anaconda.org/conda-forge/arrow-cpp/files ) > We should probably only ship dynamic libraries under Windows, as those are > reasonably small. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (ARROW-5101) [Packaging] Avoid bundling static libraries in Windows conda packages
[ https://issues.apache.org/jira/browse/ARROW-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916745#comment-16916745 ] Krisztian Szucs edited comment on ARROW-5101 at 8/27/19 1:59 PM: - This should have been resolved by https://github.com/conda-forge/arrow-cpp-feedstock/commit/d6e21db3f1f1da713194c305a91eb6e4b3b3a1d4 already [~pitrou] could you check with version 0.14? was (Author: kszucs): This should have been resolved by https://github.com/conda-forge/arrow-cpp-feedstock/commit/d6e21db3f1f1da713194c305a91eb6e4b3b3a1d4 [~pitrou] could you check with version 0.14? > [Packaging] Avoid bundling static libraries in Windows conda packages > - > > Key: ARROW-5101 > URL: https://issues.apache.org/jira/browse/ARROW-5101 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Packaging >Affects Versions: 0.13.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: conda > Fix For: 0.15.0 > > > We're currently bundling static libraries in Windows conda packages. > Unfortunately, it causes these to be quite large: > {code:bash} > $ ls -la ./Library/lib > total 507808 > drwxrwxr-x 4 antoine antoine 4096 avril 3 10:28 . > drwxrwxr-x 5 antoine antoine 4096 avril 3 10:28 .. > -rw-rw-r-- 1 antoine antoine 1507048 avril 1 20:58 arrow.lib > -rw-rw-r-- 1 antoine antoine 76184 avril 1 20:59 arrow_python.lib > -rw-rw-r-- 1 antoine antoine 61323846 avril 1 21:00 arrow_python_static.lib > -rw-rw-r-- 1 antoine antoine 32809 avril 1 21:02 arrow_static.lib > drwxrwxr-x 3 antoine antoine 4096 avril 3 10:28 cmake > -rw-rw-r-- 1 antoine antoine491292 avril 1 21:02 parquet.lib > -rw-rw-r-- 1 antoine antoine 128473780 avril 1 21:03 parquet_static.lib > drwxrwxr-x 2 antoine antoine 4096 avril 3 10:27 pkgconfig > {code} > (see files in https://anaconda.org/conda-forge/arrow-cpp/files ) > We should probably only ship dynamic libraries under Windows, as those are > reasonably small. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6231) [Python] Consider assigning default column names when reading CSV file and header_rows=0
[ https://issues.apache.org/jira/browse/ARROW-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6231: -- Labels: csv pull-request-available (was: csv) > [Python] Consider assigning default column names when reading CSV file and > header_rows=0 > > > Key: ARROW-6231 > URL: https://issues.apache.org/jira/browse/ARROW-6231 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: csv, pull-request-available > Fix For: 0.15.0 > > > This is a slight usability rough edge. Assigning default names (like "f0, f1, > ...") would probably be better since then at least you can see how many > columns there are and what is in them. > {code} > In [10]: parse_options = csv.ParseOptions(delimiter='|', header_rows=0) > > > In [11]: %time table = csv.read_csv('Performance_2016Q4.txt', > parse_options=parse_options) > > --- > ArrowInvalid Traceback (most recent call last) > in > ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/_csv.pyx in > pyarrow._csv.read_csv() > ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status() > ArrowInvalid: header_rows == 0 needs explicit column names > {code} > In pandas integers are used, so some kind of default string would have to be > defined > {code} > In [18]: df = pd.read_csv('Performance_2016Q4.txt', sep='|', header=None, > low_memory=False) > > In [19]: df.columns > > > Out[19]: > Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, > 16, > 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], >dtype='int64') > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (ARROW-6231) [Python] Consider assigning default column names when reading CSV file and header_rows=0
[ https://issues.apache.org/jira/browse/ARROW-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-6231: - Assignee: Antoine Pitrou > [Python] Consider assigning default column names when reading CSV file and > header_rows=0 > > > Key: ARROW-6231 > URL: https://issues.apache.org/jira/browse/ARROW-6231 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: csv > Fix For: 0.15.0 > > > This is a slight usability rough edge. Assigning default names (like "f0, f1, > ...") would probably be better since then at least you can see how many > columns there are and what is in them. > {code} > In [10]: parse_options = csv.ParseOptions(delimiter='|', header_rows=0) > > > In [11]: %time table = csv.read_csv('Performance_2016Q4.txt', > parse_options=parse_options) > > --- > ArrowInvalid Traceback (most recent call last) > in > ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/_csv.pyx in > pyarrow._csv.read_csv() > ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status() > ArrowInvalid: header_rows == 0 needs explicit column names > {code} > In pandas integers are used, so some kind of default string would have to be > defined > {code} > In [18]: df = pd.read_csv('Performance_2016Q4.txt', sep='|', header=None, > low_memory=False) > > In [19]: df.columns > > > Out[19]: > Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, > 16, > 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], >dtype='int64') > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5960) [C++] Boost dependencies are specified in wrong order
[ https://issues.apache.org/jira/browse/ARROW-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916711#comment-16916711 ] Ingo Müller commented on ARROW-5960: OK, here it is: https://github.com/apache/arrow/pull/5205 > [C++] Boost dependencies are specified in wrong order > - > > Key: ARROW-5960 > URL: https://issues.apache.org/jira/browse/ARROW-5960 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.0 >Reporter: Ingo Müller >Priority: Minor > > The boost dependencies in cpp/CMakeLists.txt are specified in the wrong > order: the system library currently comes first, followed by the filesystem > library. They should be specified in the opposite order, as filesystem > depends on system. > It seems to depend on the version of boost or how it is compiled whether this > problem becomes apparent. I am currently setting up the project like this: > {code:java} > CXX=clang++-7.0 CC=clang-7.0 \ > cmake \ > -DCMAKE_CXX_STANDARD=17 \ > -DCMAKE_INSTALL_PREFIX=/tmp/arrow4/dist \ > -DCMAKE_INSTALL_LIBDIR=lib \ > -DARROW_WITH_RAPIDJSON=ON \ > -DARROW_PARQUET=ON \ > -DARROW_PYTHON=ON \ > -DARROW_FLIGHT=OFF \ > -DARROW_GANDIVA=OFF \ > -DARROW_BUILD_UTILITIES=OFF \ > -DARROW_CUDA=OFF \ > -DARROW_ORC=OFF \ > -DARROW_JNI=OFF \ > -DARROW_TENSORFLOW=OFF \ > -DARROW_HDFS=OFF \ > -DARROW_BUILD_TESTS=OFF \ > -DARROW_RPATH_ORIGIN=ON \ > ..{code} > After compiling, I libarrow.so is missing symbols: > {code:java} > nm -C /dist/lib/libarrow.so | grep boost::system::system_c > U boost::system::system_category(){code} > It seems like this is related to whether or not boost has been compiled with > {{BOOST_SYSTEM_NO_DEPRECATED}}. (according to [this > post|https://stackoverflow.com/a/30877725/651937], anyway). I have to say > that I don't understand why boost as BUNDLED should be compiled that way... > If I apply the following patch, everything works as expected: > > {code:java} > diff -pur a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt > --- a/cpp/CMakeLists.txt 2019-06-29 00:26:37.0 +0200 > +++ b/cpp/CMakeLists.txt 2019-07-16 16:36:03.980153919 +0200 > @@ -642,8 +642,8 @@ if(ARROW_STATIC_LINK_LIBS) > add_dependencies(arrow_dependencies ${ARROW_STATIC_LINK_LIBS}) > endif() > -set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} > ${BOOST_SYSTEM_LIBRARY} > - ${BOOST_FILESYSTEM_LIBRARY} > ${BOOST_REGEX_LIBRARY}) > +set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} > ${BOOST_FILESYSTEM_LIBRARY} > + ${BOOST_SYSTEM_LIBRARY} > ${BOOST_REGEX_LIBRARY}) > list(APPEND ARROW_STATIC_LINK_LIBS ${BOOST_SYSTEM_LIBRARY} > ${BOOST_FILESYSTEM_LIBRARY} > ${BOOST_REGEX_LIBRARY}){code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6353) [Python] Allow user to select compression level in pyarrow.parquet.write_table
[ https://issues.apache.org/jira/browse/ARROW-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916694#comment-16916694 ] Igor Yastrebov commented on ARROW-6353: --- [~martinradev] You are free to work on it if you want. I'd love to see this feature in 0.15.0 but since I won't do it myself I'm in no position to ask for it. As far as I'm concerned, there are only two levels of priority - blocker and non-blocker - but jira admins can correct it if it is a problem. > [Python] Allow user to select compression level in pyarrow.parquet.write_table > -- > > Key: ARROW-6353 > URL: https://issues.apache.org/jira/browse/ARROW-6353 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Igor Yastrebov >Priority: Major > > This feature was introduced for C++ in > [ARROW-6216|https://issues.apache.org/jira/browse/ARROW-6216]. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (ARROW-5830) [C++] Stop using memcmp in TensorEquals
[ https://issues.apache.org/jira/browse/ARROW-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-5830: - Assignee: Kenta Murata > [C++] Stop using memcmp in TensorEquals > --- > > Key: ARROW-5830 > URL: https://issues.apache.org/jira/browse/ARROW-5830 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kenta Murata >Assignee: Kenta Murata >Priority: Major > Labels: beginner, pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Because memcmp problematic for comparing floating-point values, such as NaNs. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-5830) [C++] Stop using memcmp in TensorEquals
[ https://issues.apache.org/jira/browse/ARROW-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-5830. --- Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 5166 [https://github.com/apache/arrow/pull/5166] > [C++] Stop using memcmp in TensorEquals > --- > > Key: ARROW-5830 > URL: https://issues.apache.org/jira/browse/ARROW-5830 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kenta Murata >Priority: Major > Labels: beginner, pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Because memcmp problematic for comparing floating-point values, such as NaNs. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6358) [C++] FileSystem::DeleteDir should make it optional to delete the directory itself
[ https://issues.apache.org/jira/browse/ARROW-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916616#comment-16916616 ] Rok Mihevc commented on ARROW-6358: --- Ah yes, sorry, missed the bucket case. As an occasional S3 user I would be surprised if arrow deleted a bucket and not only it's contents. But I can imagine it would be useful to have that option sometimes. > [C++] FileSystem::DeleteDir should make it optional to delete the directory > itself > -- > > Key: ARROW-6358 > URL: https://issues.apache.org/jira/browse/ARROW-6358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.14.1 >Reporter: Antoine Pitrou >Priority: Major > > In some situations, it can be desirable to delete the entirety of a > directory's contents, but not the directory itself (e.g. when it's a S3 > bucket). Perhaps we should add an option for that. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6366) [Java] Make field vectors final explicitly
[ https://issues.apache.org/jira/browse/ARROW-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6366: -- Labels: pull-request-available (was: ) > [Java] Make field vectors final explicitly > -- > > Key: ARROW-6366 > URL: https://issues.apache.org/jira/browse/ARROW-6366 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > > According to the discussion in > [https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E,] > field vectors should not be extended, so they should be made final > explicitly. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6229) [C++] Add a DataSource implementation which scans a directory
[ https://issues.apache.org/jira/browse/ARROW-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-6229. --- Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 5139 [https://github.com/apache/arrow/pull/5139] > [C++] Add a DataSource implementation which scans a directory > - > > Key: ARROW-6229 > URL: https://issues.apache.org/jira/browse/ARROW-6229 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Benjamin Kietzman >Assignee: Benjamin Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 5.5h > Remaining Estimate: 0h > > DirectoryBasedDataSource should scan a directory (optionally recursively) on > construction, yielding FileBasedDataFragments -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6363) [R] segfault in Table__from_dots with unexpected schema
[ https://issues.apache.org/jira/browse/ARROW-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-6363. --- Resolution: Fixed Issue resolved by pull request 5199 [https://github.com/apache/arrow/pull/5199] > [R] segfault in Table__from_dots with unexpected schema > --- > > Key: ARROW-6363 > URL: https://issues.apache.org/jira/browse/ARROW-6363 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > > {code:r} > > table(b=1L, schema=c(b = int16())) > *** caught segfault *** > address 0x7fada725aed0, cause 'memory not mapped' > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6338) [R] Type function names don't match type names
[ https://issues.apache.org/jira/browse/ARROW-6338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-6338. --- Resolution: Fixed Issue resolved by pull request 5198 [https://github.com/apache/arrow/pull/5198] > [R] Type function names don't match type names > -- > > Key: ARROW-6338 > URL: https://issues.apache.org/jira/browse/ARROW-6338 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > > I noticed this while working on documentation for ARROW-5505, trying to show > how you could pass an explicit schema definition to make a table. For a few > types, the name of the type that gets printed (and comes from the C++ > library) doesn't match the name of the function you use to specify the type > in a schema: > {code:r} > > tab <- to_arrow(data.frame( > + a = 1:10, > + b = as.numeric(1:10), > + c = sample(c(TRUE, FALSE, NA), 10, replace = TRUE), > + d = letters[1:10], > + stringsAsFactors = FALSE > + )) > > tab$schema > arrow::Schema > a: int32 > b: double > c: bool > d: string > # Alright, let's make that schema > > schema(a = int32(), b = double(), c = bool(), d = string()) > Error in bool() : could not find function "bool" > # Hmm, ok, so bool --> boolean() > > schema(a = int32(), b = double(), c = boolean(), d = string()) > Error in string() : could not find function "string" > # string --> utf8() > > schema(a = int32(), b = double(), c = boolean(), d = utf8()) > Error: type does not inherit from class arrow::DataType > # Wha? > > double() > numeric(0) > # Oh. double is a base R function. > > schema(a = int32(), b = float64(), c = boolean(), d = utf8()) > arrow::Schema > a: int32 > b: double > c: bool > d: string > {code} > If you believe this switch statement is correct, these three, along with > float and half_float, are the only mismatches: > [https://github.com/apache/arrow/blob/master/r/R/R6.R#L81-L109] > {code:r} > > schema(b = float64(), c = boolean(), d = utf8(), e = float32(), f = > > float16()) > arrow::Schema > b: double > c: bool > d: string > e: float > f: halffloat > {code} > I can add aliases (i.e. another function that does the same thing) for bool, > string, float, and halffloat, and I can add some magic so that double() (and > even integer()) work inside the schema() function. But in looking into the > C++ side to confirm where these alternate type names were coming from, I saw > some inconsistencies. For example, > https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L773-L788 > suggests that the StringType should report its name as "utf8". But the > ToString method here > https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc#L191 has it > report as "string". It's unclear why those should report differently. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6323) [R] Expand file paths when passing to readers
[ https://issues.apache.org/jira/browse/ARROW-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-6323. --- Resolution: Fixed Issue resolved by pull request 5169 [https://github.com/apache/arrow/pull/5169] > [R] Expand file paths when passing to readers > - > > Key: ARROW-6323 > URL: https://issues.apache.org/jira/browse/ARROW-6323 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h > Remaining Estimate: 0h > > All file paths in R are wrapped in {{fs::path_abs()}}, which handles relative > paths, but it doesn't expand {{~}}, so this fails: > {code:java} > > df <- read_parquet("~/Downloads/demofile.parquet") > Error in io___MemoryMappedFile__Open(fs::path_abs(path), mode) : > IOError: Failed to open local file '~/Downloads/demofile.parquet', error: > No such file or directory > {code} > This is fixed by using {{fs::path_real()}} instead. > Should this be properly handled in C++ though? cc [~pitrou] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-6367) [C++][Gandiva] Implement string reverse
[ https://issues.apache.org/jira/browse/ARROW-6367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prudhvi Porandla updated ARROW-6367: Description: add {{utf8 reverse(utf8)}} function in Gandiva > [C++][Gandiva] Implement string reverse > --- > > Key: ARROW-6367 > URL: https://issues.apache.org/jira/browse/ARROW-6367 > Project: Apache Arrow > Issue Type: Task >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Minor > > add {{utf8 reverse(utf8)}} function in Gandiva -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6367) [C++][Gandiva] Implement string reverse
Prudhvi Porandla created ARROW-6367: --- Summary: [C++][Gandiva] Implement string reverse Key: ARROW-6367 URL: https://issues.apache.org/jira/browse/ARROW-6367 Project: Apache Arrow Issue Type: Task Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla -- This message was sent by Atlassian Jira (v8.3.2#803003)