date:20190827

[jira] [Updated] (ARROW-6314) [C++] Implement changes to ensure flatbuffer alignment.

2019-08-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6314:
--
Labels: pull-request-available  (was: )

> [C++] Implement changes to ensure flatbuffer alignment.
> ---
>
> Key: ARROW-6314
> URL: https://issues.apache.org/jira/browse/ARROW-6314
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6372) [Rust][Datafusion] Predictate push down optimization can break query plan

2019-08-27 Thread Paddy Horan (Jira)

Paddy Horan created ARROW-6372:
--

 Summary: [Rust][Datafusion] Predictate push down optimization can 
break query plan
 Key: ARROW-6372
 URL: https://issues.apache.org/jira/browse/ARROW-6372
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Affects Versions: 0.14.1
Reporter: Paddy Horan
 Fix For: 0.15.0


The following code reproduces the issue:

[https://gist.github.com/paddyhoran/598db6cbb790fc5497320613e54a02c6]

If you disable the predicate push down optimization it works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6371) [Doc] Row to columnar conversion example mentions arrow::Column in comments

2019-08-27 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917300#comment-16917300
 ] 

Wes McKinney commented on ARROW-6371:
-

Thanks, can you submit a PR to fix?

> [Doc] Row to columnar conversion example mentions arrow::Column in comments
> ---
>
> Key: ARROW-6371
> URL: https://issues.apache.org/jira/browse/ARROW-6371
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Omer Ozarslan
>Priority: Minor
>
> https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html
> {code:cpp}
> // The final representation should be an `arrow::Table` which in turn is made 
> up of
> // an `arrow::Schema` and a list of `arrow::Column`. An `arrow::Column` is 
> again a
> // named collection of one or more `arrow::Array` instances. As the first 
> step, we
> // will iterate over the data and build up the arrays incrementally.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-3829) [Python] Support protocols to extract Arrow objects from third-party classes

2019-08-27 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3829.
-
Resolution: Fixed

Issue resolved by pull request 5106
[https://github.com/apache/arrow/pull/5106]

> [Python] Support protocols to extract Arrow objects from third-party classes
> 
>
> Key: ARROW-3829
> URL: https://issues.apache.org/jira/browse/ARROW-3829
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> In the style of NumPy's {{__array__}}, we should be able to ask inputs to 
> {{pa.array}}, {{pa.Table.from_X}}, ... whether they can convert themselves to 
> Arrow objects. This would allow for example to turn objects that hold an 
> Arrow object internally to expose them directly instead of going a conversion 
> path.
> My current use case involves Pandas {{ExtensionArray}} instances that 
> internally have Arrow objects and should be reused when we pass the whole 
> {{DataFrame}} to {{pa.Table.from_pandas}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6358) [C++] FileSystem::DeleteDir should make it optional to delete the directory itself

2019-08-27 Thread Rok Mihevc (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917191#comment-16917191
 ] 

Rok Mihevc commented on ARROW-6358:
---

Got it. Thanks!

> [C++] FileSystem::DeleteDir should make it optional to delete the directory 
> itself
> --
>
> Key: ARROW-6358
> URL: https://issues.apache.org/jira/browse/ARROW-6358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Antoine Pitrou
>Priority: Major
>
> In some situations, it can be desirable to delete the entirety of a 
> directory's contents, but not the directory itself (e.g. when it's a S3 
> bucket). Perhaps we should add an option for that.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6358) [C++] FileSystem::DeleteDir should make it optional to delete the directory itself

2019-08-27 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917147#comment-16917147
 ] 

Antoine Pitrou commented on ARROW-6358:
---

This is doable using {{SubTreeFileSystem}}.

> [C++] FileSystem::DeleteDir should make it optional to delete the directory 
> itself
> --
>
> Key: ARROW-6358
> URL: https://issues.apache.org/jira/browse/ARROW-6358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Antoine Pitrou
>Priority: Major
>
> In some situations, it can be desirable to delete the entirety of a 
> directory's contents, but not the directory itself (e.g. when it's a S3 
> bucket). Perhaps we should add an option for that.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6358) [C++] FileSystem::DeleteDir should make it optional to delete the directory itself

2019-08-27 Thread Rok Mihevc (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917142#comment-16917142
 ] 

Rok Mihevc commented on ARROW-6358:
---

Can you treat bucket as root of the filesystem?

> [C++] FileSystem::DeleteDir should make it optional to delete the directory 
> itself
> --
>
> Key: ARROW-6358
> URL: https://issues.apache.org/jira/browse/ARROW-6358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Antoine Pitrou
>Priority: Major
>
> In some situations, it can be desirable to delete the entirety of a 
> directory's contents, but not the directory itself (e.g. when it's a S3 
> bucket). Perhaps we should add an option for that.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6371) [Doc] Row to columnar conversion example mentions arrow::Column in comments

2019-08-27 Thread Omer Ozarslan (Jira)

Omer Ozarslan created ARROW-6371:


 Summary: [Doc] Row to columnar conversion example mentions 
arrow::Column in comments
 Key: ARROW-6371
 URL: https://issues.apache.org/jira/browse/ARROW-6371
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation
Reporter: Omer Ozarslan


https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html

{code:cpp}
// The final representation should be an `arrow::Table` which in turn is made 
up of
// an `arrow::Schema` and a list of `arrow::Column`. An `arrow::Column` is 
again a
// named collection of one or more `arrow::Array` instances. As the first step, 
we
// will iterate over the data and build up the arrays incrementally.
{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5960) [C++] Boost dependencies are specified in wrong order

2019-08-27 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-5960:
-

Assignee: Ingo Müller

> [C++] Boost dependencies are specified in wrong order
> -
>
> Key: ARROW-5960
> URL: https://issues.apache.org/jira/browse/ARROW-5960
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Ingo Müller
>Assignee: Ingo Müller
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
>  The boost dependencies in cpp/CMakeLists.txt are specified in the wrong 
> order: the system library currently comes first, followed by the filesystem 
> library. They should be specified in the opposite order, as filesystem 
> depends on system.
> It seems to depend on the version of boost or how it is compiled whether this 
> problem becomes apparent. I am currently setting up the project like this:
> {code:java}
> CXX=clang++-7.0 CC=clang-7.0 \
>     cmake \
>     -DCMAKE_CXX_STANDARD=17 \
>     -DCMAKE_INSTALL_PREFIX=/tmp/arrow4/dist \
>     -DCMAKE_INSTALL_LIBDIR=lib \
>     -DARROW_WITH_RAPIDJSON=ON \
>     -DARROW_PARQUET=ON \
>     -DARROW_PYTHON=ON \
>     -DARROW_FLIGHT=OFF \
>     -DARROW_GANDIVA=OFF \
>     -DARROW_BUILD_UTILITIES=OFF \
>     -DARROW_CUDA=OFF \
>     -DARROW_ORC=OFF \
>     -DARROW_JNI=OFF \
>     -DARROW_TENSORFLOW=OFF \
>     -DARROW_HDFS=OFF \
>     -DARROW_BUILD_TESTS=OFF \
>     -DARROW_RPATH_ORIGIN=ON \
>     ..{code}
> After compiling, I libarrow.so is missing symbols:
> {code:java}
> nm -C /dist/lib/libarrow.so | grep boost::system::system_c
>  U boost::system::system_category(){code}
> It seems like this is related to whether or not boost has been compiled with 
> {{BOOST_SYSTEM_NO_DEPRECATED}}. (according to [this 
> post|https://stackoverflow.com/a/30877725/651937], anyway). I have to say 
> that I don't understand why boost as BUNDLED should be compiled that way...
> If I apply the following patch, everything works as expected:
>  
> {code:java}
> diff -pur a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
> --- a/cpp/CMakeLists.txt   2019-06-29 00:26:37.0 +0200
> +++ b/cpp/CMakeLists.txt    2019-07-16 16:36:03.980153919 +0200
> @@ -642,8 +642,8 @@ if(ARROW_STATIC_LINK_LIBS)
>    add_dependencies(arrow_dependencies ${ARROW_STATIC_LINK_LIBS})
>  endif()
> -set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} 
> ${BOOST_SYSTEM_LIBRARY}
> -   ${BOOST_FILESYSTEM_LIBRARY} 
> ${BOOST_REGEX_LIBRARY})
> +set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} 
> ${BOOST_FILESYSTEM_LIBRARY}
> +   ${BOOST_SYSTEM_LIBRARY} 
> ${BOOST_REGEX_LIBRARY})
>  list(APPEND ARROW_STATIC_LINK_LIBS ${BOOST_SYSTEM_LIBRARY} 
> ${BOOST_FILESYSTEM_LIBRARY}
>  ${BOOST_REGEX_LIBRARY}){code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-5960) [C++] Boost dependencies are specified in wrong order

2019-08-27 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5960.
---
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5205
[https://github.com/apache/arrow/pull/5205]

> [C++] Boost dependencies are specified in wrong order
> -
>
> Key: ARROW-5960
> URL: https://issues.apache.org/jira/browse/ARROW-5960
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Ingo Müller
>Priority: Minor
> Fix For: 0.15.0
>
>
>  The boost dependencies in cpp/CMakeLists.txt are specified in the wrong 
> order: the system library currently comes first, followed by the filesystem 
> library. They should be specified in the opposite order, as filesystem 
> depends on system.
> It seems to depend on the version of boost or how it is compiled whether this 
> problem becomes apparent. I am currently setting up the project like this:
> {code:java}
> CXX=clang++-7.0 CC=clang-7.0 \
>     cmake \
>     -DCMAKE_CXX_STANDARD=17 \
>     -DCMAKE_INSTALL_PREFIX=/tmp/arrow4/dist \
>     -DCMAKE_INSTALL_LIBDIR=lib \
>     -DARROW_WITH_RAPIDJSON=ON \
>     -DARROW_PARQUET=ON \
>     -DARROW_PYTHON=ON \
>     -DARROW_FLIGHT=OFF \
>     -DARROW_GANDIVA=OFF \
>     -DARROW_BUILD_UTILITIES=OFF \
>     -DARROW_CUDA=OFF \
>     -DARROW_ORC=OFF \
>     -DARROW_JNI=OFF \
>     -DARROW_TENSORFLOW=OFF \
>     -DARROW_HDFS=OFF \
>     -DARROW_BUILD_TESTS=OFF \
>     -DARROW_RPATH_ORIGIN=ON \
>     ..{code}
> After compiling, I libarrow.so is missing symbols:
> {code:java}
> nm -C /dist/lib/libarrow.so | grep boost::system::system_c
>  U boost::system::system_category(){code}
> It seems like this is related to whether or not boost has been compiled with 
> {{BOOST_SYSTEM_NO_DEPRECATED}}. (according to [this 
> post|https://stackoverflow.com/a/30877725/651937], anyway). I have to say 
> that I don't understand why boost as BUNDLED should be compiled that way...
> If I apply the following patch, everything works as expected:
>  
> {code:java}
> diff -pur a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
> --- a/cpp/CMakeLists.txt   2019-06-29 00:26:37.0 +0200
> +++ b/cpp/CMakeLists.txt    2019-07-16 16:36:03.980153919 +0200
> @@ -642,8 +642,8 @@ if(ARROW_STATIC_LINK_LIBS)
>    add_dependencies(arrow_dependencies ${ARROW_STATIC_LINK_LIBS})
>  endif()
> -set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} 
> ${BOOST_SYSTEM_LIBRARY}
> -   ${BOOST_FILESYSTEM_LIBRARY} 
> ${BOOST_REGEX_LIBRARY})
> +set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} 
> ${BOOST_FILESYSTEM_LIBRARY}
> +   ${BOOST_SYSTEM_LIBRARY} 
> ${BOOST_REGEX_LIBRARY})
>  list(APPEND ARROW_STATIC_LINK_LIBS ${BOOST_SYSTEM_LIBRARY} 
> ${BOOST_FILESYSTEM_LIBRARY}
>  ${BOOST_REGEX_LIBRARY}){code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5960) [C++] Boost dependencies are specified in wrong order

2019-08-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5960:
--
Labels: pull-request-available  (was: )

> [C++] Boost dependencies are specified in wrong order
> -
>
> Key: ARROW-5960
> URL: https://issues.apache.org/jira/browse/ARROW-5960
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Ingo Müller
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
>  The boost dependencies in cpp/CMakeLists.txt are specified in the wrong 
> order: the system library currently comes first, followed by the filesystem 
> library. They should be specified in the opposite order, as filesystem 
> depends on system.
> It seems to depend on the version of boost or how it is compiled whether this 
> problem becomes apparent. I am currently setting up the project like this:
> {code:java}
> CXX=clang++-7.0 CC=clang-7.0 \
>     cmake \
>     -DCMAKE_CXX_STANDARD=17 \
>     -DCMAKE_INSTALL_PREFIX=/tmp/arrow4/dist \
>     -DCMAKE_INSTALL_LIBDIR=lib \
>     -DARROW_WITH_RAPIDJSON=ON \
>     -DARROW_PARQUET=ON \
>     -DARROW_PYTHON=ON \
>     -DARROW_FLIGHT=OFF \
>     -DARROW_GANDIVA=OFF \
>     -DARROW_BUILD_UTILITIES=OFF \
>     -DARROW_CUDA=OFF \
>     -DARROW_ORC=OFF \
>     -DARROW_JNI=OFF \
>     -DARROW_TENSORFLOW=OFF \
>     -DARROW_HDFS=OFF \
>     -DARROW_BUILD_TESTS=OFF \
>     -DARROW_RPATH_ORIGIN=ON \
>     ..{code}
> After compiling, I libarrow.so is missing symbols:
> {code:java}
> nm -C /dist/lib/libarrow.so | grep boost::system::system_c
>  U boost::system::system_category(){code}
> It seems like this is related to whether or not boost has been compiled with 
> {{BOOST_SYSTEM_NO_DEPRECATED}}. (according to [this 
> post|https://stackoverflow.com/a/30877725/651937], anyway). I have to say 
> that I don't understand why boost as BUNDLED should be compiled that way...
> If I apply the following patch, everything works as expected:
>  
> {code:java}
> diff -pur a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
> --- a/cpp/CMakeLists.txt   2019-06-29 00:26:37.0 +0200
> +++ b/cpp/CMakeLists.txt    2019-07-16 16:36:03.980153919 +0200
> @@ -642,8 +642,8 @@ if(ARROW_STATIC_LINK_LIBS)
>    add_dependencies(arrow_dependencies ${ARROW_STATIC_LINK_LIBS})
>  endif()
> -set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} 
> ${BOOST_SYSTEM_LIBRARY}
> -   ${BOOST_FILESYSTEM_LIBRARY} 
> ${BOOST_REGEX_LIBRARY})
> +set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} 
> ${BOOST_FILESYSTEM_LIBRARY}
> +   ${BOOST_SYSTEM_LIBRARY} 
> ${BOOST_REGEX_LIBRARY})
>  list(APPEND ARROW_STATIC_LINK_LIBS ${BOOST_SYSTEM_LIBRARY} 
> ${BOOST_FILESYSTEM_LIBRARY}
>  ${BOOST_REGEX_LIBRARY}){code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6242) [C++] Implements basic Dataset/Scanner/ScannerBuilder

2019-08-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6242:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++] Implements basic Dataset/Scanner/ScannerBuilder
> -
>
> Key: ARROW-6242
> URL: https://issues.apache.org/jira/browse/ARROW-6242
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
>
> The goal of this would be to iterate over a Dataset and generate a 
> "flattened" stream of RecordBatches from the union of data sources and data 
> fragments. This should not bother with filtering yet.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6364) [R] Handling unexpected input to time64() et al

2019-08-27 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6364.
---
Resolution: Fixed

Issue resolved by pull request 5201
[https://github.com/apache/arrow/pull/5201]

> [R] Handling unexpected input to time64() et al
> ---
>
> Key: ARROW-6364
> URL: https://issues.apache.org/jira/browse/ARROW-6364
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> {code:r}
> > time64()
> Error in Time64__initialize(unit) : 
>   argument "unit" is missing, with no default
> > time64("ms")
> Error in Time64__initialize(unit) : 
>   Not compatible with requested type: [type=character; target=integer].
> > time64(1)
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0826 11:13:34.657388 162407872 type.cc:234]  Check failed: unit == 
> TimeUnit::MICRO || unit == TimeUnit::NANO Must be microseconds or nanoseconds
> *** Check failure stack trace: ***
> Abort trap: 6
> > time64(1L)
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0826 11:14:09.445202 251229632 type.cc:234]  Check failed: unit == 
> TimeUnit::MICRO || unit == TimeUnit::NANO Must be microseconds or nanoseconds
> *** Check failure stack trace: ***
> Abort trap: 6
> > time64("MILLI")
> Error in Time64__initialize(unit) : 
>   Not compatible with requested type: [type=character; target=integer].
> > time64(TimeUnit$MILLI)
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0826 11:15:12.047847 361547200 type.cc:234]  Check failed: unit == 
> TimeUnit::MICRO || unit == TimeUnit::NANO Must be microseconds or nanoseconds
> *** Check failure stack trace: ***
> Abort trap: 6
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6370) [JS] Table.from adds 0 on int columns

2019-08-27 Thread Sascha Hofmann (Jira)

Sascha Hofmann created ARROW-6370:
-

 Summary: [JS] Table.from adds 0 on int columns
 Key: ARROW-6370
 URL: https://issues.apache.org/jira/browse/ARROW-6370
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 0.14.1
Reporter: Sascha Hofmann


I am generating an arrow table in pyarrow and send it via gRPC like this:
{code:java}
sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, batch.schema)
writer.write_batch(batch)
writer.close()
yield ds.Response(
status=200,
loading=False,
response=[sink.getvalue().to_pybytes()]   
)
{code}
On the javascript end, I parse it like that:
{code:java}
 Table.from(response.getResponseList()[0])
{code}
That works but when I look at the actual table, int columns have a 0 for every 
other row. String columns seem to be parsed just fine. 

The Python byte array created from to_pybytes() has the same length as received 
in javascript. I am also able to recreate the original table for the byte array 
in Python. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6368) [C++] Add RecordBatch projection functionality

2019-08-27 Thread Benjamin Kietzman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916880#comment-16916880
 ] 

Benjamin Kietzman commented on ARROW-6368:
--

If the operation is potentially expensive then we might want to avoid hiding it 
in the projector; in the case of augmenting as described above the columns to 
augment can be generated once then reused for each yielded batch.

> [C++] Add RecordBatch projection functionality
> --
>
> Key: ARROW-6368
> URL: https://issues.apache.org/jira/browse/ARROW-6368
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
>  Labels: dataset
>
> define classes RecordBatchProjector (which projects from one schema to 
> another, augmenting with null/constant columns where necessary) and a subtype 
> of RecordBatchIterator which projects each batch yielded by a wrapped 
> iterator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5101) [Packaging] Avoid bundling static libraries in Windows conda packages

2019-08-27 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916825#comment-16916825
 ] 

Wes McKinney commented on ARROW-5101:
-

[~kszucs] can you check? Do you have a Windows VM?

> [Packaging] Avoid bundling static libraries in Windows conda packages
> -
>
> Key: ARROW-5101
> URL: https://issues.apache.org/jira/browse/ARROW-5101
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Packaging
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: conda
> Fix For: 0.15.0
>
>
> We're currently bundling static libraries in Windows conda packages. 
> Unfortunately, it causes these to be quite large:
> {code:bash}
> $ ls -la ./Library/lib
> total 507808
> drwxrwxr-x 4 antoine antoine  4096 avril  3 10:28 .
> drwxrwxr-x 5 antoine antoine  4096 avril  3 10:28 ..
> -rw-rw-r-- 1 antoine antoine   1507048 avril  1 20:58 arrow.lib
> -rw-rw-r-- 1 antoine antoine 76184 avril  1 20:59 arrow_python.lib
> -rw-rw-r-- 1 antoine antoine  61323846 avril  1 21:00 arrow_python_static.lib
> -rw-rw-r-- 1 antoine antoine 32809 avril  1 21:02 arrow_static.lib
> drwxrwxr-x 3 antoine antoine  4096 avril  3 10:28 cmake
> -rw-rw-r-- 1 antoine antoine491292 avril  1 21:02 parquet.lib
> -rw-rw-r-- 1 antoine antoine 128473780 avril  1 21:03 parquet_static.lib
> drwxrwxr-x 2 antoine antoine  4096 avril  3 10:27 pkgconfig
> {code}
> (see files in https://anaconda.org/conda-forge/arrow-cpp/files )
> We should probably only ship dynamic libraries under Windows, as those are 
> reasonably small.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6358) [C++] FileSystem::DeleteDir should make it optional to delete the directory itself

2019-08-27 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916821#comment-16916821
 ] 

Antoine Pitrou commented on ARROW-6358:
---

I don't think generic code should have to bother about the notion of "bucket", 
though. I'll have to think about it more.

> [C++] FileSystem::DeleteDir should make it optional to delete the directory 
> itself
> --
>
> Key: ARROW-6358
> URL: https://issues.apache.org/jira/browse/ARROW-6358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Antoine Pitrou
>Priority: Major
>
> In some situations, it can be desirable to delete the entirety of a 
> directory's contents, but not the directory itself (e.g. when it's a S3 
> bucket). Perhaps we should add an option for that.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6369) [Python] Support list-of-boolean in Array.to_pandas conversion

2019-08-27 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6369:
---

 Summary: [Python] Support list-of-boolean in Array.to_pandas 
conversion
 Key: ARROW-6369
 URL: https://issues.apache.org/jira/browse/ARROW-6369
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.15.0


See

{code}
In [4]: paste   

   
a = pa.array(np.array([[True, False], [True, True, True]]))

## -- End pasted text --

In [5]: a   

   
Out[5]: 

[
  [
true,
false
  ],
  [
true,
true,
true
  ]
]

In [6]: a.to_pandas()   

   
---
ArrowNotImplementedError  Traceback (most recent call last)
 in 
> 1 a.to_pandas()

~/code/arrow/python/pyarrow/array.pxi in 
pyarrow.lib._PandasConvertible.to_pandas()
439 deduplicate_objects=deduplicate_objects)
440 
--> 441 return self._to_pandas(options, categories=categories,
442ignore_metadata=ignore_metadata)
443 

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array._to_pandas()
815 
816 with nogil:
--> 817 check_status(ConvertArrayToPandas(c_options, self.sp_array,
818   self, ))
819 return wrap_array_output(out)

~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
 84 raise ArrowKeyError(message)
 85 elif status.IsNotImplemented():
---> 86 raise ArrowNotImplementedError(message)
 87 elif status.IsTypeError():
 88 raise ArrowTypeError(message)

ArrowNotImplementedError: Not implemented type for lists: bool
In ../src/arrow/python/arrow_to_pandas.cc, line 1910, code: 
VisitTypeInline(*data_->type(), this)
{code}

as reported in https://github.com/apache/arrow/issues/5203



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6358) [C++] FileSystem::DeleteDir should make it optional to delete the directory itself

2019-08-27 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916819#comment-16916819
 ] 

Wes McKinney commented on ARROW-6358:
-

So another way to "skin this cat" is simply to not delete buckets in any of the 
generic filesystem APIs, and instead add a "DeleteBucket" function that is 
S3-specific. Not sure how challenging this would be to incorporate into the 
testing regimen

> [C++] FileSystem::DeleteDir should make it optional to delete the directory 
> itself
> --
>
> Key: ARROW-6358
> URL: https://issues.apache.org/jira/browse/ARROW-6358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Antoine Pitrou
>Priority: Major
>
> In some situations, it can be desirable to delete the entirety of a 
> directory's contents, but not the directory itself (e.g. when it's a S3 
> bucket). Perhaps we should add an option for that.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6301) [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type found'

2019-08-27 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916814#comment-16916814
 ] 

Wes McKinney commented on ARROW-6301:
-

[~klichukb] do you still have the segfault on the master branch? 

> [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name 
> arrow.py_extension_type found'
> ---
>
> Key: ARROW-6301
> URL: https://issues.apache.org/jira/browse/ARROW-6301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: linux, virtualenv, uwsgi, cpython 2.7
>Reporter: David Alphus
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> On interrupt, I am frequently seeing the atexit function failing in pyarrow 
> 0.14.1.
> {code:java}
>  ^CSIGINT/SIGQUIT received...killing workers... 
> killing the spooler with pid 22640 
> Error in atexit._run_exitfuncs: 
> Traceback (most recent call last): 
>   File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in 
> _run_exitfuncs 
>     func(*targs, **kargs) 
>   File "pyarrow/types.pxi", line 1860, in 
> pyarrow.lib._unregister_py_extension_type 
>     check_status(UnregisterPyExtensionType()) 
>   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status 
>     raise ArrowKeyError(message) 
> ArrowKeyError: 'No type extension with name arrow.py_extension_type found' 
> Error in sys.exitfunc: 
> Traceback (most recent call last): 
>   File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in 
> _run_exitfuncs 
>     func(*targs, **kargs) 
>   File "pyarrow/types.pxi", line 1860, in 
> pyarrow.lib._unregister_py_extension_type 
>   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowKeyError: 'No type extension with name 
> arrow.py_extension_type found' 
> spooler (pid: 22640) annihilated 
> worker 1 buried after 1 seconds 
> goodbye to uWSGI.{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5131) [Python] Add Azure Datalake Filesystem Gen1 Wrapper for pyarrow

2019-08-27 Thread Gregory Hayes (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916809#comment-16916809
 ] 

Gregory Hayes commented on ARROW-5131:
--

Thanks.  We can close this, knowing that’s the strategic direction. 

> [Python] Add Azure Datalake Filesystem Gen1 Wrapper for pyarrow
> ---
>
> Key: ARROW-5131
> URL: https://issues.apache.org/jira/browse/ARROW-5131
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Gregory Hayes
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> The current pyarrow package can only read parquet files that have been 
> written to Gen1 Azure Datalake using the fastparquet engine.  This only works 
> if the dask-adlfs package is explicitly installed and imported.  I've added a 
> method to the dask-adlfs package, found 
> [here|https://github.com/dask/dask-adlfs], and issued a PR for that change.  
> To support this capability, added an ADLFSWrapper to filesystem.py file.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3 with s3fs

2019-08-27 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916807#comment-16916807
 ] 

Wes McKinney commented on ARROW-6058:
-

Seems this can be fixed now by upgrading to {{fsspec=0.4.2}}

> [Python][Parquet] Failure when reading Parquet file from S3 with s3fs
> -
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3 with s3fs

2019-08-27 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916807#comment-16916807
 ] 

Wes McKinney edited comment on ARROW-6058 at 8/27/19 3:28 PM:
--

Seems this can be fixed now by upgrading to {{fsspec==0.4.2}}


was (Author: wesmckinn):
Seems this can be fixed now by upgrading to {{fsspec=0.4.2}}

> [Python][Parquet] Failure when reading Parquet file from S3 with s3fs
> -
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6301) [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type found'

2019-08-27 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916799#comment-16916799
 ] 

Antoine Pitrou commented on ARROW-6301:
---

[~klichukb] not to my knowledge, can you open a new issue?

> [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name 
> arrow.py_extension_type found'
> ---
>
> Key: ARROW-6301
> URL: https://issues.apache.org/jira/browse/ARROW-6301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: linux, virtualenv, uwsgi, cpython 2.7
>Reporter: David Alphus
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> On interrupt, I am frequently seeing the atexit function failing in pyarrow 
> 0.14.1.
> {code:java}
>  ^CSIGINT/SIGQUIT received...killing workers... 
> killing the spooler with pid 22640 
> Error in atexit._run_exitfuncs: 
> Traceback (most recent call last): 
>   File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in 
> _run_exitfuncs 
>     func(*targs, **kargs) 
>   File "pyarrow/types.pxi", line 1860, in 
> pyarrow.lib._unregister_py_extension_type 
>     check_status(UnregisterPyExtensionType()) 
>   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status 
>     raise ArrowKeyError(message) 
> ArrowKeyError: 'No type extension with name arrow.py_extension_type found' 
> Error in sys.exitfunc: 
> Traceback (most recent call last): 
>   File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in 
> _run_exitfuncs 
>     func(*targs, **kargs) 
>   File "pyarrow/types.pxi", line 1860, in 
> pyarrow.lib._unregister_py_extension_type 
>   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowKeyError: 'No type extension with name 
> arrow.py_extension_type found' 
> spooler (pid: 22640) annihilated 
> worker 1 buried after 1 seconds 
> goodbye to uWSGI.{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6301) [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type found'

2019-08-27 Thread Bogdan Klichuk (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916798#comment-16916798
 ] 

Bogdan Klichuk commented on ARROW-6301:
---

Bumping this thread with related segfault, that [~david.alphus] has during 
uWSGI atexit.

I have  a custom atexit handler during uWSGI graceful shutdown, which uses 
pyarrow code. Getting segfault. Is there an issue created for this?

> [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name 
> arrow.py_extension_type found'
> ---
>
> Key: ARROW-6301
> URL: https://issues.apache.org/jira/browse/ARROW-6301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: linux, virtualenv, uwsgi, cpython 2.7
>Reporter: David Alphus
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> On interrupt, I am frequently seeing the atexit function failing in pyarrow 
> 0.14.1.
> {code:java}
>  ^CSIGINT/SIGQUIT received...killing workers... 
> killing the spooler with pid 22640 
> Error in atexit._run_exitfuncs: 
> Traceback (most recent call last): 
>   File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in 
> _run_exitfuncs 
>     func(*targs, **kargs) 
>   File "pyarrow/types.pxi", line 1860, in 
> pyarrow.lib._unregister_py_extension_type 
>     check_status(UnregisterPyExtensionType()) 
>   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status 
>     raise ArrowKeyError(message) 
> ArrowKeyError: 'No type extension with name arrow.py_extension_type found' 
> Error in sys.exitfunc: 
> Traceback (most recent call last): 
>   File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in 
> _run_exitfuncs 
>     func(*targs, **kargs) 
>   File "pyarrow/types.pxi", line 1860, in 
> pyarrow.lib._unregister_py_extension_type 
>   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowKeyError: 'No type extension with name 
> arrow.py_extension_type found' 
> spooler (pid: 22640) annihilated 
> worker 1 buried after 1 seconds 
> goodbye to uWSGI.{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6368) [C++] Add RecordBatch projection functionality

2019-08-27 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916790#comment-16916790
 ] 

Wes McKinney commented on ARROW-6368:
-

You might consider making this general enough to handle type alterations or 
other operations, too. This could be addressed later also

> [C++] Add RecordBatch projection functionality
> --
>
> Key: ARROW-6368
> URL: https://issues.apache.org/jira/browse/ARROW-6368
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
>  Labels: dataset
>
> define classes RecordBatchProjector (which projects from one schema to 
> another, augmenting with null/constant columns where necessary) and a subtype 
> of RecordBatchIterator which projects each batch yielded by a wrapped 
> iterator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6368) [C++] Add RecordBatch projection functionality

2019-08-27 Thread Benjamin Kietzman (Jira)

Benjamin Kietzman created ARROW-6368:


 Summary: [C++] Add RecordBatch projection functionality
 Key: ARROW-6368
 URL: https://issues.apache.org/jira/browse/ARROW-6368
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


define classes RecordBatchProjector (which projects from one schema to another, 
augmenting with null/constant columns where necessary) and a subtype of 
RecordBatchIterator which projects each batch yielded by a wrapped iterator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-2769) [Python] Deprecate and rename add_metadata methods

2019-08-27 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-2769:
--

Assignee: Krisztian Szucs

> [Python] Deprecate and rename add_metadata methods
> --
>
> Key: ARROW-2769
> URL: https://issues.apache.org/jira/browse/ARROW-2769
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Minor
> Fix For: 0.15.0
>
>
> Deprecate and replace `pyarrow.Field.add_metadata` (and other likely named 
> methods) with replace_metadata, set_metadata or with_metadata. Knowing 
> Spark's immutable API, I would have chosen with_metadata but I guess this is 
> probably not what the average Python user would expect as naming.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6324) [C++] File system API should expand paths

2019-08-27 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916747#comment-16916747
 ] 

Antoine Pitrou commented on ARROW-6324:
---

Some possibilities:
* do it implicitly in LocalFileSystem
* do it explicitly in a dedicated convenience API
* do it explicitly in a generic path conversion layer (that could also do other 
things, e.g. strip trailing slashes on S3)


> [C++] File system API should expand paths
> -
>
> Key: ARROW-6324
> URL: https://issues.apache.org/jira/browse/ARROW-6324
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Minor
>  Labels: filesystem
>
> See ARROW-6323



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5101) [Packaging] Avoid bundling static libraries in Windows conda packages

2019-08-27 Thread Krisztian Szucs (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916745#comment-16916745
 ] 

Krisztian Szucs commented on ARROW-5101:


This should have been resolved by 
https://github.com/conda-forge/arrow-cpp-feedstock/commit/d6e21db3f1f1da713194c305a91eb6e4b3b3a1d4

[~pitrou] could you check with version 0.14?

> [Packaging] Avoid bundling static libraries in Windows conda packages
> -
>
> Key: ARROW-5101
> URL: https://issues.apache.org/jira/browse/ARROW-5101
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Packaging
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: conda
> Fix For: 0.15.0
>
>
> We're currently bundling static libraries in Windows conda packages. 
> Unfortunately, it causes these to be quite large:
> {code:bash}
> $ ls -la ./Library/lib
> total 507808
> drwxrwxr-x 4 antoine antoine  4096 avril  3 10:28 .
> drwxrwxr-x 5 antoine antoine  4096 avril  3 10:28 ..
> -rw-rw-r-- 1 antoine antoine   1507048 avril  1 20:58 arrow.lib
> -rw-rw-r-- 1 antoine antoine 76184 avril  1 20:59 arrow_python.lib
> -rw-rw-r-- 1 antoine antoine  61323846 avril  1 21:00 arrow_python_static.lib
> -rw-rw-r-- 1 antoine antoine 32809 avril  1 21:02 arrow_static.lib
> drwxrwxr-x 3 antoine antoine  4096 avril  3 10:28 cmake
> -rw-rw-r-- 1 antoine antoine491292 avril  1 21:02 parquet.lib
> -rw-rw-r-- 1 antoine antoine 128473780 avril  1 21:03 parquet_static.lib
> drwxrwxr-x 2 antoine antoine  4096 avril  3 10:27 pkgconfig
> {code}
> (see files in https://anaconda.org/conda-forge/arrow-cpp/files )
> We should probably only ship dynamic libraries under Windows, as those are 
> reasonably small.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (ARROW-5101) [Packaging] Avoid bundling static libraries in Windows conda packages

2019-08-27 Thread Krisztian Szucs (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916745#comment-16916745
 ] 

Krisztian Szucs edited comment on ARROW-5101 at 8/27/19 1:59 PM:
-

This should have been resolved by 
https://github.com/conda-forge/arrow-cpp-feedstock/commit/d6e21db3f1f1da713194c305a91eb6e4b3b3a1d4
 already

[~pitrou] could you check with version 0.14?


was (Author: kszucs):
This should have been resolved by 
https://github.com/conda-forge/arrow-cpp-feedstock/commit/d6e21db3f1f1da713194c305a91eb6e4b3b3a1d4

[~pitrou] could you check with version 0.14?

> [Packaging] Avoid bundling static libraries in Windows conda packages
> -
>
> Key: ARROW-5101
> URL: https://issues.apache.org/jira/browse/ARROW-5101
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Packaging
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: conda
> Fix For: 0.15.0
>
>
> We're currently bundling static libraries in Windows conda packages. 
> Unfortunately, it causes these to be quite large:
> {code:bash}
> $ ls -la ./Library/lib
> total 507808
> drwxrwxr-x 4 antoine antoine  4096 avril  3 10:28 .
> drwxrwxr-x 5 antoine antoine  4096 avril  3 10:28 ..
> -rw-rw-r-- 1 antoine antoine   1507048 avril  1 20:58 arrow.lib
> -rw-rw-r-- 1 antoine antoine 76184 avril  1 20:59 arrow_python.lib
> -rw-rw-r-- 1 antoine antoine  61323846 avril  1 21:00 arrow_python_static.lib
> -rw-rw-r-- 1 antoine antoine 32809 avril  1 21:02 arrow_static.lib
> drwxrwxr-x 3 antoine antoine  4096 avril  3 10:28 cmake
> -rw-rw-r-- 1 antoine antoine491292 avril  1 21:02 parquet.lib
> -rw-rw-r-- 1 antoine antoine 128473780 avril  1 21:03 parquet_static.lib
> drwxrwxr-x 2 antoine antoine  4096 avril  3 10:27 pkgconfig
> {code}
> (see files in https://anaconda.org/conda-forge/arrow-cpp/files )
> We should probably only ship dynamic libraries under Windows, as those are 
> reasonably small.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6231) [Python] Consider assigning default column names when reading CSV file and header_rows=0

2019-08-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6231:
--
Labels: csv pull-request-available  (was: csv)

> [Python] Consider assigning default column names when reading CSV file and 
> header_rows=0
> 
>
> Key: ARROW-6231
> URL: https://issues.apache.org/jira/browse/ARROW-6231
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: csv, pull-request-available
> Fix For: 0.15.0
>
>
> This is a slight usability rough edge. Assigning default names (like "f0, f1, 
> ...") would probably be better since then at least you can see how many 
> columns there are and what is in them. 
> {code}
> In [10]: parse_options = csv.ParseOptions(delimiter='|', header_rows=0)   
>   
> 
> In [11]: %time table = csv.read_csv('Performance_2016Q4.txt', 
> parse_options=parse_options)  
> 
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/_csv.pyx in 
> pyarrow._csv.read_csv()
> ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/error.pxi 
> in pyarrow.lib.check_status()
> ArrowInvalid: header_rows == 0 needs explicit column names
> {code}
> In pandas integers are used, so some kind of default string would have to be 
> defined
> {code}
> In [18]: df = pd.read_csv('Performance_2016Q4.txt', sep='|', header=None, 
> low_memory=False) 
> 
> In [19]: df.columns   
>   
> 
> Out[19]: 
> Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 
> 16,
> 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
>dtype='int64')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6231) [Python] Consider assigning default column names when reading CSV file and header_rows=0

2019-08-27 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-6231:
-

Assignee: Antoine Pitrou

> [Python] Consider assigning default column names when reading CSV file and 
> header_rows=0
> 
>
> Key: ARROW-6231
> URL: https://issues.apache.org/jira/browse/ARROW-6231
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: csv
> Fix For: 0.15.0
>
>
> This is a slight usability rough edge. Assigning default names (like "f0, f1, 
> ...") would probably be better since then at least you can see how many 
> columns there are and what is in them. 
> {code}
> In [10]: parse_options = csv.ParseOptions(delimiter='|', header_rows=0)   
>   
> 
> In [11]: %time table = csv.read_csv('Performance_2016Q4.txt', 
> parse_options=parse_options)  
> 
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/_csv.pyx in 
> pyarrow._csv.read_csv()
> ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/error.pxi 
> in pyarrow.lib.check_status()
> ArrowInvalid: header_rows == 0 needs explicit column names
> {code}
> In pandas integers are used, so some kind of default string would have to be 
> defined
> {code}
> In [18]: df = pd.read_csv('Performance_2016Q4.txt', sep='|', header=None, 
> low_memory=False) 
> 
> In [19]: df.columns   
>   
> 
> Out[19]: 
> Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 
> 16,
> 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
>dtype='int64')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5960) [C++] Boost dependencies are specified in wrong order

2019-08-27 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916711#comment-16916711
 ] 

Ingo Müller commented on ARROW-5960:


OK, here it is: https://github.com/apache/arrow/pull/5205

> [C++] Boost dependencies are specified in wrong order
> -
>
> Key: ARROW-5960
> URL: https://issues.apache.org/jira/browse/ARROW-5960
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Ingo Müller
>Priority: Minor
>
>  The boost dependencies in cpp/CMakeLists.txt are specified in the wrong 
> order: the system library currently comes first, followed by the filesystem 
> library. They should be specified in the opposite order, as filesystem 
> depends on system.
> It seems to depend on the version of boost or how it is compiled whether this 
> problem becomes apparent. I am currently setting up the project like this:
> {code:java}
> CXX=clang++-7.0 CC=clang-7.0 \
>     cmake \
>     -DCMAKE_CXX_STANDARD=17 \
>     -DCMAKE_INSTALL_PREFIX=/tmp/arrow4/dist \
>     -DCMAKE_INSTALL_LIBDIR=lib \
>     -DARROW_WITH_RAPIDJSON=ON \
>     -DARROW_PARQUET=ON \
>     -DARROW_PYTHON=ON \
>     -DARROW_FLIGHT=OFF \
>     -DARROW_GANDIVA=OFF \
>     -DARROW_BUILD_UTILITIES=OFF \
>     -DARROW_CUDA=OFF \
>     -DARROW_ORC=OFF \
>     -DARROW_JNI=OFF \
>     -DARROW_TENSORFLOW=OFF \
>     -DARROW_HDFS=OFF \
>     -DARROW_BUILD_TESTS=OFF \
>     -DARROW_RPATH_ORIGIN=ON \
>     ..{code}
> After compiling, I libarrow.so is missing symbols:
> {code:java}
> nm -C /dist/lib/libarrow.so | grep boost::system::system_c
>  U boost::system::system_category(){code}
> It seems like this is related to whether or not boost has been compiled with 
> {{BOOST_SYSTEM_NO_DEPRECATED}}. (according to [this 
> post|https://stackoverflow.com/a/30877725/651937], anyway). I have to say 
> that I don't understand why boost as BUNDLED should be compiled that way...
> If I apply the following patch, everything works as expected:
>  
> {code:java}
> diff -pur a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
> --- a/cpp/CMakeLists.txt   2019-06-29 00:26:37.0 +0200
> +++ b/cpp/CMakeLists.txt    2019-07-16 16:36:03.980153919 +0200
> @@ -642,8 +642,8 @@ if(ARROW_STATIC_LINK_LIBS)
>    add_dependencies(arrow_dependencies ${ARROW_STATIC_LINK_LIBS})
>  endif()
> -set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} 
> ${BOOST_SYSTEM_LIBRARY}
> -   ${BOOST_FILESYSTEM_LIBRARY} 
> ${BOOST_REGEX_LIBRARY})
> +set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} 
> ${BOOST_FILESYSTEM_LIBRARY}
> +   ${BOOST_SYSTEM_LIBRARY} 
> ${BOOST_REGEX_LIBRARY})
>  list(APPEND ARROW_STATIC_LINK_LIBS ${BOOST_SYSTEM_LIBRARY} 
> ${BOOST_FILESYSTEM_LIBRARY}
>  ${BOOST_REGEX_LIBRARY}){code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6353) [Python] Allow user to select compression level in pyarrow.parquet.write_table

2019-08-27 Thread Igor Yastrebov (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916694#comment-16916694
 ] 

Igor Yastrebov commented on ARROW-6353:
---

[~martinradev]

You are free to work on it if you want. I'd love to see this feature in 0.15.0 
but since I won't do it myself I'm in no position to ask for it.

 

As far as I'm concerned, there are only two levels of priority - blocker and 
non-blocker - but jira admins can correct it if it is a problem.

> [Python] Allow user to select compression level in pyarrow.parquet.write_table
> --
>
> Key: ARROW-6353
> URL: https://issues.apache.org/jira/browse/ARROW-6353
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Igor Yastrebov
>Priority: Major
>
> This feature was introduced for C++ in 
> [ARROW-6216|https://issues.apache.org/jira/browse/ARROW-6216].



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5830) [C++] Stop using memcmp in TensorEquals

2019-08-27 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-5830:
-

Assignee: Kenta Murata

> [C++] Stop using memcmp in TensorEquals
> ---
>
> Key: ARROW-5830
> URL: https://issues.apache.org/jira/browse/ARROW-5830
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kenta Murata
>Assignee: Kenta Murata
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Because memcmp problematic for comparing floating-point values, such as NaNs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-5830) [C++] Stop using memcmp in TensorEquals

2019-08-27 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5830.
---
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5166
[https://github.com/apache/arrow/pull/5166]

> [C++] Stop using memcmp in TensorEquals
> ---
>
> Key: ARROW-5830
> URL: https://issues.apache.org/jira/browse/ARROW-5830
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kenta Murata
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Because memcmp problematic for comparing floating-point values, such as NaNs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6358) [C++] FileSystem::DeleteDir should make it optional to delete the directory itself

2019-08-27 Thread Rok Mihevc (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916616#comment-16916616
 ] 

Rok Mihevc commented on ARROW-6358:
---

Ah yes, sorry, missed the bucket case.
As an occasional S3 user I would be surprised if arrow deleted a bucket and not 
only it's contents. But I can imagine it would be useful to have that option 
sometimes.

> [C++] FileSystem::DeleteDir should make it optional to delete the directory 
> itself
> --
>
> Key: ARROW-6358
> URL: https://issues.apache.org/jira/browse/ARROW-6358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Antoine Pitrou
>Priority: Major
>
> In some situations, it can be desirable to delete the entirety of a 
> directory's contents, but not the directory itself (e.g. when it's a S3 
> bucket). Perhaps we should add an option for that.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6366) [Java] Make field vectors final explicitly

2019-08-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6366:
--
Labels: pull-request-available  (was: )

> [Java] Make field vectors final explicitly
> --
>
> Key: ARROW-6366
> URL: https://issues.apache.org/jira/browse/ARROW-6366
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>
> According to the discussion in 
> [https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E,]
>  field vectors should not be extended, so they should be made final 
> explicitly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6229) [C++] Add a DataSource implementation which scans a directory

2019-08-27 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6229.
---
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5139
[https://github.com/apache/arrow/pull/5139]

> [C++] Add a DataSource implementation which scans a directory
> -
>
> Key: ARROW-6229
> URL: https://issues.apache.org/jira/browse/ARROW-6229
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> DirectoryBasedDataSource should scan a directory (optionally recursively) on 
> construction, yielding FileBasedDataFragments



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6363) [R] segfault in Table__from_dots with unexpected schema

2019-08-27 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6363.
---
Resolution: Fixed

Issue resolved by pull request 5199
[https://github.com/apache/arrow/pull/5199]

> [R] segfault in Table__from_dots with unexpected schema
> ---
>
> Key: ARROW-6363
> URL: https://issues.apache.org/jira/browse/ARROW-6363
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:r}
> > table(b=1L, schema=c(b = int16()))
>  *** caught segfault ***
> address 0x7fada725aed0, cause 'memory not mapped'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6338) [R] Type function names don't match type names

2019-08-27 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6338.
---
Resolution: Fixed

Issue resolved by pull request 5198
[https://github.com/apache/arrow/pull/5198]

> [R] Type function names don't match type names
> --
>
> Key: ARROW-6338
> URL: https://issues.apache.org/jira/browse/ARROW-6338
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I noticed this while working on documentation for ARROW-5505, trying to show 
> how you could pass an explicit schema definition to make a table. For a few 
> types, the name of the type that gets printed (and comes from the C++ 
> library) doesn't match the name of the function you use to specify the type 
> in a schema:
> {code:r}
> > tab <- to_arrow(data.frame(
> +   a = 1:10,
> +   b = as.numeric(1:10),
> +   c = sample(c(TRUE, FALSE, NA), 10, replace = TRUE),
> +   d = letters[1:10],
> +   stringsAsFactors = FALSE
> + ))
> > tab$schema
> arrow::Schema 
> a: int32
> b: double
> c: bool
> d: string 
> # Alright, let's make that schema
> > schema(a = int32(), b = double(), c = bool(), d = string())
> Error in bool() : could not find function "bool"
> # Hmm, ok, so bool --> boolean()
> > schema(a = int32(), b = double(), c = boolean(), d = string())
> Error in string() : could not find function "string"
> # string --> utf8()
> > schema(a = int32(), b = double(), c = boolean(), d = utf8())
> Error: type does not inherit from class arrow::DataType
> # Wha?
> > double()
> numeric(0)
> # Oh. double is a base R function.
> > schema(a = int32(), b = float64(), c = boolean(), d = utf8())
> arrow::Schema 
> a: int32
> b: double
> c: bool
> d: string 
> {code}
> If you believe this switch statement is correct, these three, along with 
> float and half_float, are the only mismatches: 
> [https://github.com/apache/arrow/blob/master/r/R/R6.R#L81-L109]
> {code:r}
> > schema(b = float64(), c = boolean(), d = utf8(), e = float32(), f = 
> > float16())
> arrow::Schema 
> b: double
> c: bool
> d: string
> e: float
> f: halffloat 
> {code}
> I can add aliases (i.e. another function that does the same thing) for bool, 
> string, float, and halffloat, and I can add some magic so that double() (and 
> even integer()) work inside the schema() function. But in looking into the 
> C++ side to confirm where these alternate type names were coming from, I saw 
> some inconsistencies. For example, 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L773-L788 
> suggests that the StringType should report its name as "utf8". But the 
> ToString method here 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc#L191 has it 
> report as "string". It's unclear why those should report differently.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6323) [R] Expand file paths when passing to readers

2019-08-27 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6323.
---
Resolution: Fixed

Issue resolved by pull request 5169
[https://github.com/apache/arrow/pull/5169]

> [R] Expand file paths when passing to readers
> -
>
> Key: ARROW-6323
> URL: https://issues.apache.org/jira/browse/ARROW-6323
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> All file paths in R are wrapped in {{fs::path_abs()}}, which handles relative 
> paths, but it doesn't expand {{~}}, so this fails:
> {code:java}
> > df <- read_parquet("~/Downloads/demofile.parquet")
>  Error in io___MemoryMappedFile__Open(fs::path_abs(path), mode) :
>    IOError: Failed to open local file '~/Downloads/demofile.parquet', error: 
> No such file or directory
> {code}
> This is fixed by using {{fs::path_real()}} instead.
> Should this be properly handled in C++ though? cc [~pitrou]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6367) [C++][Gandiva] Implement string reverse

2019-08-27 Thread Prudhvi Porandla (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-6367:

Description: add {{utf8 reverse(utf8)}} function in Gandiva

> [C++][Gandiva] Implement string reverse
> ---
>
> Key: ARROW-6367
> URL: https://issues.apache.org/jira/browse/ARROW-6367
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>
> add {{utf8 reverse(utf8)}} function in Gandiva



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6367) [C++][Gandiva] Implement string reverse

2019-08-27 Thread Prudhvi Porandla (Jira)

Prudhvi Porandla created ARROW-6367:
---

 Summary: [C++][Gandiva] Implement string reverse
 Key: ARROW-6367
 URL: https://issues.apache.org/jira/browse/ARROW-6367
 Project: Apache Arrow
  Issue Type: Task
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

45 matches

Mail list logo