[jira] [Created] (ARROW-6790) [Release] Automatically disable integration test cases in release verification

2019-10-03 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-6790:
---

 Summary: [Release] Automatically disable integration test cases in 
release verification
 Key: ARROW-6790
 URL: https://issues.apache.org/jira/browse/ARROW-6790
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Bryan Cutler
Assignee: Bryan Cutler


If dev/release/verify-release-candidate.sh is run with selective testing and 
includes integration tests, the selected implementations should be the only 
ones enabled when running the integration test portion. For example:

TEST_DEFAULT=0 \
TEST_CPP=1 \
TEST_JAVA=1 \
TEST_INTEGRATION=1 \
dev/release/verify-release-candidate.sh source 0.15.0 2

Should run integration only for C++ and Java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Docker organization for development images

2019-10-03 Thread Sutou Kouhei
https://hub.docker.com/u/ktou

In 
  "Docker organization for development images" on Thu, 3 Oct 2019 15:10:25 
+0200,
  Krisztián Szűcs  wrote:

> Hi,
> 
> We've created a docker hub organisation called "arrowdev"
> to host the images defined in the docker-compose.yml, see
> the following commit [1].
> So now it is possible to speed up the image builds by pulling
> the layers first, I suggest to use the --pull flag for building
> images: `docker-compose build --pull cpp`
> 
> We need to manually grant write access for committers and
> PMC members, so please send me your dockerhub username.
> 
> Thanks, Krisztian
> 
> P.S. Github has recently introduced its packaging feature, so
> we'll be able to experiment with hosting docker images on
> GitHub directly which would handle our permission settings
> out of the box. IMO we should try it once it is enabled for
> the apache/arrow repository.
> 
> [1]:
> https://github.com/apache/arrow/commit/1165cdb85b92cefcf59ac39d35f42d168cc64517


[jira] [Created] (ARROW-6789) [Python] Automatically box bytes/buffer-like values yielded from `FlightServerBase.do_action` in Result values

2019-10-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6789:
---

 Summary: [Python] Automatically box bytes/buffer-like values 
yielded from `FlightServerBase.do_action` in Result values
 Key: ARROW-6789
 URL: https://issues.apache.org/jira/browse/ARROW-6789
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


This will help with less boilerplate for server implementations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: uncertain about JIRA issue granularity

2019-10-03 Thread Micah Kornfield
Hi John,
It depends on what the change encompasses.  If it affects the format then
it would be nice to have tracking bugs in all languages to implement the
feature (i.e. adding data to the footer).

If it is an implementation specific feature then only the target languages
need to be implemented (i.e. file descriptors).

Thanks,
Micah

On Thu, Oct 3, 2019 at 4:42 PM John Muehlhausen  wrote:

> I thought I should open all of the issues for tracking even if I don't
> implement all of them right away?
>
> On Thu, Oct 3, 2019 at 5:46 PM Antoine Pitrou  wrote:
>
> >
> > Le 04/10/2019 à 00:18, John Muehlhausen a écrit :
> > > I need to create two (or more) issues for
> > >   custom_metadata in Footer ...
> > >
> >
> https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E
> > >
> > > and
> > >   memory map based on fd ...
> > >
> >
> https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E
> > >
> > > For the first one, is this five separate JIRA issues?
> > > - [Format] add a custom_metadata:[KeyValue] field to the Footer table
> in
> > > File.fbs
> > > - [C++] access File Footer custom_metadata
> > > - [Python] access File Footer custom_metadata
> > > - [JS] access File Footer custom_metadata
> > > - [Java] access File Footer custom_metadata
> >
> > One JIRA per independent implementation, at least.  Python and C++ can
> > be in the same issue, since PyArrow is a set of wrappers around the C++
> > implementation.
> >
> > > For the second, is it four? (One per language?)
> > > - [C++] retrieve fd of open memory mapped file and Open() memory mapped
> > > file by fd
> >
> > Same.  But why do you want to open this issue for every language?  Do
> > you really need this in all implementations?
> >
> > Regards
> >
> > Antoine.
> >
>


Re: uncertain about JIRA issue granularity

2019-10-03 Thread John Muehlhausen
I thought I should open all of the issues for tracking even if I don't
implement all of them right away?

On Thu, Oct 3, 2019 at 5:46 PM Antoine Pitrou  wrote:

>
> Le 04/10/2019 à 00:18, John Muehlhausen a écrit :
> > I need to create two (or more) issues for
> >   custom_metadata in Footer ...
> >
> https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E
> >
> > and
> >   memory map based on fd ...
> >
> https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E
> >
> > For the first one, is this five separate JIRA issues?
> > - [Format] add a custom_metadata:[KeyValue] field to the Footer table in
> > File.fbs
> > - [C++] access File Footer custom_metadata
> > - [Python] access File Footer custom_metadata
> > - [JS] access File Footer custom_metadata
> > - [Java] access File Footer custom_metadata
>
> One JIRA per independent implementation, at least.  Python and C++ can
> be in the same issue, since PyArrow is a set of wrappers around the C++
> implementation.
>
> > For the second, is it four? (One per language?)
> > - [C++] retrieve fd of open memory mapped file and Open() memory mapped
> > file by fd
>
> Same.  But why do you want to open this issue for every language?  Do
> you really need this in all implementations?
>
> Regards
>
> Antoine.
>


Re: arrow::io::MemoryMappedFile from fd rather than path

2019-10-03 Thread Antoine Pitrou


Le 04/10/2019 à 00:31, John Muehlhausen a écrit :
> http://lackingrhoticity.blogspot.com/2015/05/passing-fds-handles-between-processes.html
> 
> If I'm reading this correctly, it doesn't affect our Open(fd) API on
> Windows, but only how descriptors are communicated between processes that
> want to make use of it.

Yeah, well, that part will be completely different :-)  But it's not
part of Arrow concurrently (Plasma has it, but it's POSIX-only precisely).

Regards

Antoine.


Re: uncertain about JIRA issue granularity

2019-10-03 Thread Antoine Pitrou


Le 04/10/2019 à 00:18, John Muehlhausen a écrit :
> I need to create two (or more) issues for
>   custom_metadata in Footer ...
> https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E
> 
> and
>   memory map based on fd ...
> https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E
> 
> For the first one, is this five separate JIRA issues?
> - [Format] add a custom_metadata:[KeyValue] field to the Footer table in
> File.fbs
> - [C++] access File Footer custom_metadata
> - [Python] access File Footer custom_metadata
> - [JS] access File Footer custom_metadata
> - [Java] access File Footer custom_metadata

One JIRA per independent implementation, at least.  Python and C++ can
be in the same issue, since PyArrow is a set of wrappers around the C++
implementation.

> For the second, is it four? (One per language?)
> - [C++] retrieve fd of open memory mapped file and Open() memory mapped
> file by fd

Same.  But why do you want to open this issue for every language?  Do
you really need this in all implementations?

Regards

Antoine.


Re: arrow::io::MemoryMappedFile from fd rather than path

2019-10-03 Thread John Muehlhausen
http://lackingrhoticity.blogspot.com/2015/05/passing-fds-handles-between-processes.html

If I'm reading this correctly, it doesn't affect our Open(fd) API on
Windows, but only how descriptors are communicated between processes that
want to make use of it.

On Thu, Oct 3, 2019 at 4:24 PM Antoine Pitrou  wrote:

>
> Le 03/10/2019 à 23:21, John Muehlhausen a écrit :
> >
> > Would we just make a variant of Open() that takes a fd rather than a
> path?
>
> That sounds like a good idea.  Would you like to open a JIRA and a PR?
>
> > Would this API have any analogy on Windows?  Do we have platform-specific
> > functionality?
>
> File descriptors exist on Windows, so it should be fine there as well.
>
> Regards
>
> Antoine.
>


uncertain about JIRA issue granularity

2019-10-03 Thread John Muehlhausen
I need to create two (or more) issues for
  custom_metadata in Footer ...
https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E

and
  memory map based on fd ...
https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E

For the first one, is this five separate JIRA issues?
- [Format] add a custom_metadata:[KeyValue] field to the Footer table in
File.fbs
- [C++] access File Footer custom_metadata
- [Python] access File Footer custom_metadata
- [JS] access File Footer custom_metadata
- [Java] access File Footer custom_metadata

For the second, is it four? (One per language?)
- [C++] retrieve fd of open memory mapped file and Open() memory mapped
file by fd
- [Python] retrieve fd of open memory mapped file and Open() memory mapped
file by fd
- [JS] retrieve fd of open memory mapped file and Open() memory mapped file
by fd
- [Java] retrieve fd of open memory mapped file and Open() memory mapped
file by fd

(I will also work on ARROW-5916 if I can carve out the time)


[jira] [Created] (ARROW-6788) [CI] Migrate Travis CI lint job to GitHub Actions

2019-10-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6788:
---

 Summary: [CI] Migrate Travis CI lint job to GitHub Actions
 Key: ARROW-6788
 URL: https://issues.apache.org/jira/browse/ARROW-6788
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Wes McKinney
 Fix For: 1.0.0


Depends on ARROW-5802. As far as I can tell GitHub Actions jobs run more or 
less immediately so this will give more prompt feedback to contributors



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: arrow::io::MemoryMappedFile from fd rather than path

2019-10-03 Thread Antoine Pitrou


Le 03/10/2019 à 23:21, John Muehlhausen a écrit :
> 
> Would we just make a variant of Open() that takes a fd rather than a path?

That sounds like a good idea.  Would you like to open a JIRA and a PR?

> Would this API have any analogy on Windows?  Do we have platform-specific
> functionality?

File descriptors exist on Windows, so it should be fine there as well.

Regards

Antoine.


[jira] [Created] (ARROW-6787) [CI] Decommission "C++ with clang 7 and system packages" Travis CI job

2019-10-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6787:
---

 Summary: [CI] Decommission "C++ with clang 7 and system packages" 
Travis CI job
 Key: ARROW-6787
 URL: https://issues.apache.org/jira/browse/ARROW-6787
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Now that this is running in GitHub Actions, we can probably skip it in Travis 
CI?

Any other barriers to turning this off and saving the CI build time?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


arrow::io::MemoryMappedFile from fd rather than path

2019-10-03 Thread John Muehlhausen
I have a situation where multiple processes need to access a memory mapped
file.

However, between the time the first process maps the file and the time a
subsequent process in the group maps the file, the file may have been
removed from the filesystem.  (I.e. has no "path")  Coordinating the cache
pruner (which would remove the file) to not affect the overall "atomicity"
of the process group would be a real chore.

Therefore I need to communicate and use the file descriptor rather than the
path name when subsequent processes map the file.  (Using SCM_RIGHTS on a
unix socket, /proc/.../fd ... as a couple of ways that come to mind
cannot inherit the fd since the parent proc is often the late joiner.)

Would we just make a variant of Open() that takes a fd rather than a path?

Related to this, need to be able to discover the fd of a mapped file and
need these APIs in Python as well.

Would this API have any analogy on Windows?  Do we have platform-specific
functionality?

Thoughts?

-John


[jira] [Created] (ARROW-6786) [C++] arrow-dataset-file-parquet-test is slow

2019-10-03 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6786:
-

 Summary: [C++] arrow-dataset-file-parquet-test is slow
 Key: ARROW-6786
 URL: https://issues.apache.org/jira/browse/ARROW-6786
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


It takes 15 seconds in debug mode (probably more with ASAN /  UBSAN /etc.) to 
run 2 tests that simply iterated through a generated in-memory dataset:
{code}
$ ./build-test/debug/arrow-dataset-file-parquet-test 
Running main() from 
/home/conda/feedstock_root/build_artifacts/gtest_1551008230529/work/googletest/src/gtest_main.cc
[==] Running 2 tests from 1 test case.
[--] Global test environment set-up.
[--] 2 tests from TestParquetFileFormat
[ RUN  ] TestParquetFileFormat.ScanRecordBatchReader
[   OK ] TestParquetFileFormat.ScanRecordBatchReader (7338 ms)
[ RUN  ] TestParquetFileFormat.Inspect
[   OK ] TestParquetFileFormat.Inspect (6222 ms)
[--] 2 tests from TestParquetFileFormat (13560 ms total)

[--] Global test environment tear-down
[==] 2 tests from 1 test case ran. (13560 ms total)
[  PASSED  ] 2 tests.
{code}

Unless it is stressing something in particular, the number of repetitions or 
the batch size can probably be reduced dramatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Collecting Arrow critique and our roadmap on that

2019-10-03 Thread Bryan Cutler
A lot of good info here, I added a point that has come up often for me.

On Thu, Oct 3, 2019 at 10:03 AM Wes McKinney  wrote:

> I read through and left some comments.
>
> Would be great to turn into an FAQ section in the docs and add a link
> to the navigation on the front page of the website.
>
> On Mon, Sep 23, 2019 at 1:22 PM Uwe L. Korn  wrote:
> >
> > Thanks to the all contributions that already came in. I made some more
> additions and hope to turn this into a PR to the site soon.
> >
> > Uwe
> >
> > On Fri, Sep 20, 2019, at 10:46 AM, Micah Kornfield wrote:
> > > I think this is a good idea, as well.  I added comments and additions
> on
> > > the document.
> > >
> > > On Thu, Sep 19, 2019 at 11:47 AM Neal Richardson <
> > > neal.p.richard...@gmail.com> wrote:
> > >
> > > > Uwe, I think this is an excellent idea. I've started
> > > >
> > > >
> https://docs.google.com/document/d/1cgN7mYzH30URDTaioHsCP2d80wKKHDNs9f5s7vdb2mA/edit?usp=sharing
> > > > to collect some ideas and notes. Once we have gathered our thoughts
> > > > there, we can put them in the appropriate places.
> > > >
> > > > I think that some of the result will go into the FAQ, some into
> > > > documentation (maybe more "how-to" and "getting started" guides in
> the
> > > > respective language docs, as well as some "how to share Arrow data
> > > > from X to Y"), and other things that we haven't yet done should go
> > > > into a sort of Roadmap document on the main website. We have some
> very
> > > > outdated content related to a roadmap on the confluence wiki that
> > > > should be folded in as appropriate too.
> > > >
> > > > Neal
> > > >
> > > > On Thu, Sep 19, 2019 at 10:26 AM Uwe L. Korn 
> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > there has been a lot of public discussions lately with some
> mentions of
> > > > actually informed, valid critique of things in the Arrow project.
> From my
> > > > perspective, these things include "there is not STL-native C++ Arrow
> API",
> > > > "the base build requires too much dependencies", "the pyarrow
> package is
> > > > really huge and you cannot select single components". These are
> things we
> > > > cannot tackle at the moment due to the lack of contributors to the
> project.
> > > > But we can use this as a basis to point people that critique the
> project on
> > > > this that this is not intentional but a lack of resources as well as
> it
> > > > provides another point of entry for new contributors looking for
> work.
> > > > >
> > > > > Thus I would like to start a document (possibly on the website)
> where we
> > > > list the major critiques on Arrow, mention our long-term solution to
> that
> > > > and what JIRAs need to be done for that.
> > > > >
> > > > > Would that be something others would also see as valuable?
> > > > >
> > > > > There has also been a lot of uninformed criticism, I think that
> can be
> > > > best combat by documentation, blog posts and public appearances at
> > > > conferences and is not covered by this proposal.
> > > > >
> > > > > Uwe
> > > >
> > >
>


[jira] [Created] (ARROW-6785) [JS] Remove superfluous child assignment

2019-10-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6785:
---

 Summary: [JS] Remove superfluous child assignment
 Key: ARROW-6785
 URL: https://issues.apache.org/jira/browse/ARROW-6785
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Wes McKinney
 Fix For: 1.0.0


Per PR



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6784) [C++][R] Move filter, take, select C++ code from Rcpp to C++ library

2019-10-03 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6784:
--

 Summary: [C++][R] Move filter, take, select C++ code from Rcpp to 
C++ library
 Key: ARROW-6784
 URL: https://issues.apache.org/jira/browse/ARROW-6784
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson
 Fix For: 1.0.0


Followup to ARROW-3808 and some other previous work. Of particular interest:
 * Filter and Take methods for ChunkedArray, in r/src/compute.cpp
 * Methods for that and some other things that apply Array and ChunkedArray 
methods across the columns of a RecordBatch or Table, respectively
 * RecordBatch__select and Table__select to take columns



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Docker organization for development images

2019-10-03 Thread Bryan Cutler
Sounds good, thanks Krisztian!

On Thu, Oct 3, 2019 at 6:10 AM Krisztián Szűcs 
wrote:

> Hi,
>
> We've created a docker hub organisation called "arrowdev"
> to host the images defined in the docker-compose.yml, see
> the following commit [1].
> So now it is possible to speed up the image builds by pulling
> the layers first, I suggest to use the --pull flag for building
> images: `docker-compose build --pull cpp`
>
> We need to manually grant write access for committers and
> PMC members, so please send me your dockerhub username.
>
> Thanks, Krisztian
>
> P.S. Github has recently introduced its packaging feature, so
> we'll be able to experiment with hosting docker images on
> GitHub directly which would handle our permission settings
> out of the box. IMO we should try it once it is enabled for
> the apache/arrow repository.
>
> [1]:
>
> https://github.com/apache/arrow/commit/1165cdb85b92cefcf59ac39d35f42d168cc64517
>


Re: Collecting Arrow critique and our roadmap on that

2019-10-03 Thread Wes McKinney
I read through and left some comments.

Would be great to turn into an FAQ section in the docs and add a link
to the navigation on the front page of the website.

On Mon, Sep 23, 2019 at 1:22 PM Uwe L. Korn  wrote:
>
> Thanks to the all contributions that already came in. I made some more 
> additions and hope to turn this into a PR to the site soon.
>
> Uwe
>
> On Fri, Sep 20, 2019, at 10:46 AM, Micah Kornfield wrote:
> > I think this is a good idea, as well.  I added comments and additions on
> > the document.
> >
> > On Thu, Sep 19, 2019 at 11:47 AM Neal Richardson <
> > neal.p.richard...@gmail.com> wrote:
> >
> > > Uwe, I think this is an excellent idea. I've started
> > >
> > > https://docs.google.com/document/d/1cgN7mYzH30URDTaioHsCP2d80wKKHDNs9f5s7vdb2mA/edit?usp=sharing
> > > to collect some ideas and notes. Once we have gathered our thoughts
> > > there, we can put them in the appropriate places.
> > >
> > > I think that some of the result will go into the FAQ, some into
> > > documentation (maybe more "how-to" and "getting started" guides in the
> > > respective language docs, as well as some "how to share Arrow data
> > > from X to Y"), and other things that we haven't yet done should go
> > > into a sort of Roadmap document on the main website. We have some very
> > > outdated content related to a roadmap on the confluence wiki that
> > > should be folded in as appropriate too.
> > >
> > > Neal
> > >
> > > On Thu, Sep 19, 2019 at 10:26 AM Uwe L. Korn  wrote:
> > > >
> > > > Hello,
> > > >
> > > > there has been a lot of public discussions lately with some mentions of
> > > actually informed, valid critique of things in the Arrow project. From my
> > > perspective, these things include "there is not STL-native C++ Arrow API",
> > > "the base build requires too much dependencies", "the pyarrow package is
> > > really huge and you cannot select single components". These are things we
> > > cannot tackle at the moment due to the lack of contributors to the 
> > > project.
> > > But we can use this as a basis to point people that critique the project 
> > > on
> > > this that this is not intentional but a lack of resources as well as it
> > > provides another point of entry for new contributors looking for work.
> > > >
> > > > Thus I would like to start a document (possibly on the website) where we
> > > list the major critiques on Arrow, mention our long-term solution to that
> > > and what JIRAs need to be done for that.
> > > >
> > > > Would that be something others would also see as valuable?
> > > >
> > > > There has also been a lot of uninformed criticism, I think that can be
> > > best combat by documentation, blog posts and public appearances at
> > > conferences and is not covered by this proposal.
> > > >
> > > > Uwe
> > >
> >


[jira] [Created] (ARROW-6783) [C++] Provide API for reconstruction of RecordBatch from Flatbuffer containing process memory addresses instead of relative offsets into an IPC message

2019-10-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6783:
---

 Summary: [C++] Provide API for reconstruction of RecordBatch from 
Flatbuffer containing process memory addresses instead of relative offsets into 
an IPC message
 Key: ARROW-6783
 URL: https://issues.apache.org/jira/browse/ARROW-6783
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


A lot of our development has focused on _inter_process communication rather 
than _in_process. We should start by making sure we have disassembly and 
reassembly implemented where the Buffer Flatbuffers values contain process 
memory addresses rather than offsets. This may require a bit of refactoring so 
we can use the same reassembly code path for both use cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6782) [C++] Build minimal core Arrow libraries without any Boost headers

2019-10-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6782:
---

 Summary: [C++] Build minimal core Arrow libraries without any 
Boost headers
 Key: ARROW-6782
 URL: https://issues.apache.org/jira/browse/ARROW-6782
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


We have a couple of places where these are used. It would be good to be able to 
build without any Boost headers available



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] raw pointers and FFI (C-level in-process array protocol)

2019-10-03 Thread Wes McKinney
Related: Gandiva invented its own particular way of passing memory
addresses through the JNI boundary rather than using Flatbuffers
messages

https://github.com/apache/arrow/blob/master/cpp/src/gandiva/jni/jni_common.cc#L505

I'm all for language-agnostic in-memory data passing, but there is a
use case for a C API to pass pointers at call sites while avoiding
flattening (disassembly) and unflattening (reassembly) steps.

On Thu, Oct 3, 2019 at 4:34 AM Antoine Pitrou  wrote:
>
>
> Hi Jacques,
>
> Le 03/10/2019 à 02:46, Jacques Nadeau a écrit :
> >
> > I think it is reasonable to argue that keeping any ABI (or header/struct
> > pattern) as narrow as possible would allow us to minimize overlap with the
> > existing in-memory specification. In Arrow's case, this could be as simple
> > as a single memory pointer for schema (backed by flatbuffers) and a single
> > memory location for data (that references the record batch header, which in
> > turn provides pointers into the actual arrow data). [...]
> >
> > [...] (For example, in a JVM
> > view of the world, working with a plain struct in java rather than a set of
> > memory pointers against our existing IPC formats would be quite painful and
> > we'd definitely need to create some glue code for users. I worry the same
> > pattern would occur in many other languages.)
>
> I'm trying to understand the point you're making.  Here you say that it
> was difficult for the JVM to deal with raw pointers.  But above you seem
> to argue for a flatbuffers-based serialization containing raw pointers.
>
> Here's another way to frame the question: how do you propose to do
> zero-copy between different languages if not by passing raw pointers to
> the Arrow data?  And if passing raw pointers is acceptable, what is
> wrong with the spec as proposed?
>
>
> As for creating glue code: yes, of course, that would be needed in most
> languages that want to provide this interface (including C++).  You do
> need a C FFI for that.  I'm quite sure it would be possible to implement
> this proposal in pure Python with ctypes / cffi, for example (as a toy
> example, since PyArrow exists :-)).  When writing the spec, I also took
> a look at the Go and Rust FFIs, and they seem good enough to interact
> with it.  I tried to take a look at JNI, but of course I got lost in the
> documentation :-)
>
> If you are worried that people start thinking that this proposal is part
> of the Arrow specification, perhaps we can make it clear that exposing
> this interface is optional for implementations.
>
> Regards
>
> Antoine.


[jira] [Created] (ARROW-6781) [C++] Improve and consolidate ARROW_CHECK, DCHECK macros

2019-10-03 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-6781:
---

 Summary: [C++] Improve and consolidate ARROW_CHECK, DCHECK macros
 Key: ARROW-6781
 URL: https://issues.apache.org/jira/browse/ARROW-6781
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ben Kietzman
Assignee: Ben Kietzman


Currently we have multiple macros like {{DCHECK_EQ}} and {{DCHECK_LT}} which 
check various comparisons but don't report anything about their operands. 
Furthermore, the "stream to assertion" pattern for appending extra info has 
proven fragile. I propose a new unified macro which can capture operands of 
comparisons and report them:

{code:cpp}
  int three = 3;
  int five = 5;
  DCHECK(three == five, "extra: ", 1, 2, five);
{code}

Results in check failure messages like:
{code}
F1003 11:12:46.174767  4166 logging_test.cc:141]  Check failed: three == five
  LHS: 3
  RHS: 5
extra: 125
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2019-10-03-0

2019-10-03 Thread Crossbow


Arrow Build Report for Job nightly-2019-10-03-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0

Failed Tasks:
- wheel-manylinux1-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-manylinux1-cp37m
- wheel-manylinux1-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-manylinux1-cp36m
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-appveyor-wheel-win-cp36m
- wheel-manylinux1-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-manylinux1-cp27mu
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-win-vs2015-py37
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-appveyor-wheel-win-cp37m
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-osx-clang-py36
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-manylinux2010-cp35m
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-osx-cp37m
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-linux-gcc-py27
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-linux-gcc-py37
- wheel-osx-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-osx-cp27m
- wheel-win-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-appveyor-wheel-win-cp35m
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-linux-gcc-py36
- docker-spark-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-spark-integration
- wheel-manylinux2010-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-manylinux2010-cp27mu
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-win-vs2015-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-osx-clang-py37
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-osx-clang-py27

Succeeded Tasks:
- docker-js:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-js
- docker-lint:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-lint
- docker-c_glib:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-c_glib
- docker-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-python-3.6
- ubuntu-disco-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-ubuntu-disco-arm64
- docker-iwyu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-iwyu
- docker-rust:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-rust
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-osx-cp36m
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-gandiva-jar-osx
- docker-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-cpp
- ubuntu-bionic-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-ubuntu-bionic-arm64
- docker-r:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-r
- docker-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-pandas-master
- docker-dask-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-dask-integration
- docker-cpp-cmake32:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-cpp-cmake32
- docker-r-conda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-r-conda
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-centos-6
- docker-clang-format:
  URL: 
https://github.com/ursa-labs/cros

[jira] [Created] (ARROW-6780) [C++][Parquet] Support DurationType in writing/reading parquet

2019-10-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6780:


 Summary: [C++][Parquet] Support DurationType in writing/reading 
parquet
 Key: ARROW-6780
 URL: https://issues.apache.org/jira/browse/ARROW-6780
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joris Van den Bossche


Currently this is not supported:

{code}
In [37]: table = pa.table({'a': pa.array([1, 2], pa.duration('s'))}) 

In [39]: table
Out[39]: 
pyarrow.Table
a: duration[s]

In [41]: pq.write_table(table, 'test_duration.parquet')
...
ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema 
conversion: duration[s]
{code}

There is no direct mapping to Parquet logical types. There is an INTERVAL type, 
but this more matches Arrow's  ( YEAR_MONTH or DAY_TIME) interval type. 

But, those duration values could be stored as just integers, and based on the 
serialized arrow schema, it could be restored when reading back in.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6779) [Python] Conversion from datetime.datetime to timstamp('ns') can overflow

2019-10-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6779:


 Summary: [Python] Conversion from datetime.datetime to 
timstamp('ns') can overflow
 Key: ARROW-6779
 URL: https://issues.apache.org/jira/browse/ARROW-6779
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


In the python conversion of datetime scalars, there is no check for integer 
overflow:

{code}
In [32]: pa.array([datetime.datetime(3000, 1, 1)], pa.timestamp('ns'))  

   
Out[32]: 

[
  1830-11-23 00:50:52.580896768
]
{code}

So in case the target type has nanosecond unit, this can give wrong results (I 
don't think the other resolutions can reach overflow, given the limited range 
of years of datetime.datetime).

We should probably check for this case and raise an error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6778) [C++] Support DurationType in Cast kernel

2019-10-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6778:


 Summary: [C++] Support DurationType in Cast kernel
 Key: ARROW-6778
 URL: https://issues.apache.org/jira/browse/ARROW-6778
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Docker organization for development images

2019-10-03 Thread Krisztián Szűcs
Hi,

We've created a docker hub organisation called "arrowdev"
to host the images defined in the docker-compose.yml, see
the following commit [1].
So now it is possible to speed up the image builds by pulling
the layers first, I suggest to use the --pull flag for building
images: `docker-compose build --pull cpp`

We need to manually grant write access for committers and
PMC members, so please send me your dockerhub username.

Thanks, Krisztian

P.S. Github has recently introduced its packaging feature, so
we'll be able to experiment with hosting docker images on
GitHub directly which would handle our permission settings
out of the box. IMO we should try it once it is enabled for
the apache/arrow repository.

[1]:
https://github.com/apache/arrow/commit/1165cdb85b92cefcf59ac39d35f42d168cc64517


Re: Clarifying interpretation of Buffer "length" field in Arrow protocol

2019-10-03 Thread Wes McKinney
On Thu, Oct 3, 2019 at 7:33 AM Antoine Pitrou  wrote:
>
>
> Le 03/10/2019 à 14:22, Wes McKinney a écrit :
> > On Thu, Oct 3, 2019 at 4:26 AM Antoine Pitrou  wrote:
> >>
> >>
> >> Yeah, I think the spec should be strict.  And for convenience, I'd say
> >> it should probably be the padded length (though I don't have a strong
> >> opinion).
> >
> > The reason I'm against this is that it makes it impossible for a
> > producer to preserve the exact state of its buffers for a consumer.
> >
> > For example, if you have a 1-byte validity bitmap, and you do not have
> > the flexibility to indicate in the metadata that the length is either
> > 1 (unpadded) or 8 (padded), then the producer only will ever see 8
> > bytes.
>
> I see.  Then we should mandate the non-padded length, IMHO.

I think all that needs to be said is that an unpadded size is not
invalid. If a consumer is passed a buffer that is larger than it needs
to be, there is no harm done. I can tweak the language so that there
is less uncertainty perhaps

> Regards
>
> Antoine.


Re: Clarifying interpretation of Buffer "length" field in Arrow protocol

2019-10-03 Thread Antoine Pitrou


Le 03/10/2019 à 14:22, Wes McKinney a écrit :
> On Thu, Oct 3, 2019 at 4:26 AM Antoine Pitrou  wrote:
>>
>>
>> Yeah, I think the spec should be strict.  And for convenience, I'd say
>> it should probably be the padded length (though I don't have a strong
>> opinion).
> 
> The reason I'm against this is that it makes it impossible for a
> producer to preserve the exact state of its buffers for a consumer.
> 
> For example, if you have a 1-byte validity bitmap, and you do not have
> the flexibility to indicate in the metadata that the length is either
> 1 (unpadded) or 8 (padded), then the producer only will ever see 8
> bytes.

I see.  Then we should mandate the non-padded length, IMHO.

Regards

Antoine.


Re: Clarifying interpretation of Buffer "length" field in Arrow protocol

2019-10-03 Thread Wes McKinney
On Thu, Oct 3, 2019 at 4:26 AM Antoine Pitrou  wrote:
>
>
> Yeah, I think the spec should be strict.  And for convenience, I'd say
> it should probably be the padded length (though I don't have a strong
> opinion).

The reason I'm against this is that it makes it impossible for a
producer to preserve the exact state of its buffers for a consumer.

For example, if you have a 1-byte validity bitmap, and you do not have
the flexibility to indicate in the metadata that the length is either
1 (unpadded) or 8 (padded), then the producer only will ever see 8
bytes.

Note that padding is only performed in context of the encapsulated IPC
format. If the metadata is used to communicate in-memory pointers then
it is not appropriate to pad lengths if they are not already padded.

> Regards
>
> Antoine.
>
>
> Le 03/10/2019 à 06:23, Micah Kornfield a écrit :
> > Hi Wes,
> > It seems fine to be flexible here.  However:
> >
> >
> >> This could have implications for hashing or
> >> comparisons, for example, so I think that having the flexibility to do
> >> either is a good idea.
> >
> > This statement of use-cases makes me a little nervous.  It seems like it
> > could lead to bugs if a consumer is reading from two producers that use
> > different alternatives?
> >
> > Thanks,
> > Micah
> >
> > On Mon, Sep 30, 2019 at 5:24 PM Wes McKinney  wrote:
> >
> >> I just updated my pull request from May adding language to clarify
> >> what protocol writers are expected to set when producing the Arrow
> >> binary protocol
> >>
> >> https://github.com/apache/arrow/pull/4370
> >>
> >> Implementations may allocate small buffers, or use memory which does
> >> not meet the 8-byte minimal padding requirements of the Arrow
> >> protocol. It becomes a question, then, whether to set the in-memory
> >> buffer size or the padded size when producing the protocol.
> >>
> >> This PR states that either is acceptable. As an example, a 1-byte
> >> validity buffer could have Buffer metadata stating that the size
> >> either is 1 byte or 8 bytes. Either way, 7 bytes of padding must be
> >> written to conform to the protocol. The metadata, therefore, reflects
> >> the "intent" of the protocol writer for the protocol reader. If the
> >> writer says the length is 1, then the protocol reader understands that
> >> the writer does not expect the reader to concern itself with the 7
> >> bytes of padding. This could have implications for hashing or
> >> comparisons, for example, so I think that having the flexibility to do
> >> either is a good idea.
> >>
> >> For an application that wants to guarantee that AVX512 instructions
> >> can be used on all buffers on the receiver side, it would be
> >> appropriate to include 512-bit padding in the accounting.
> >>
> >> Let me know if others think differently so we can have this properly
> >> documented for the 1.0.0 Format release.
> >>
> >> Thanks,
> >> Wes
> >>
> >


Re: [DISCUSS] Result vs Status

2019-10-03 Thread Antoine Pitrou


Le 03/10/2019 à 06:13, Micah Kornfield a écrit :
> 
>  It was my impression that we had workable solutions for using Result in at
> least Python and Glib/Ruby (I'm don't know about R).

In Python we do (though it needed a C++-side helper).

Regards

Antoine.


Re: [DISCUSS] raw pointers and FFI (C-level in-process array protocol)

2019-10-03 Thread Antoine Pitrou


Hi Jacques,

Le 03/10/2019 à 02:46, Jacques Nadeau a écrit :
> 
> I think it is reasonable to argue that keeping any ABI (or header/struct
> pattern) as narrow as possible would allow us to minimize overlap with the
> existing in-memory specification. In Arrow's case, this could be as simple
> as a single memory pointer for schema (backed by flatbuffers) and a single
> memory location for data (that references the record batch header, which in
> turn provides pointers into the actual arrow data). [...]
> 
> [...] (For example, in a JVM
> view of the world, working with a plain struct in java rather than a set of
> memory pointers against our existing IPC formats would be quite painful and
> we'd definitely need to create some glue code for users. I worry the same
> pattern would occur in many other languages.)

I'm trying to understand the point you're making.  Here you say that it
was difficult for the JVM to deal with raw pointers.  But above you seem
to argue for a flatbuffers-based serialization containing raw pointers.

Here's another way to frame the question: how do you propose to do
zero-copy between different languages if not by passing raw pointers to
the Arrow data?  And if passing raw pointers is acceptable, what is
wrong with the spec as proposed?


As for creating glue code: yes, of course, that would be needed in most
languages that want to provide this interface (including C++).  You do
need a C FFI for that.  I'm quite sure it would be possible to implement
this proposal in pure Python with ctypes / cffi, for example (as a toy
example, since PyArrow exists :-)).  When writing the spec, I also took
a look at the Go and Rust FFIs, and they seem good enough to interact
with it.  I tried to take a look at JNI, but of course I got lost in the
documentation :-)

If you are worried that people start thinking that this proposal is part
of the Arrow specification, perhaps we can make it clear that exposing
this interface is optional for implementations.

Regards

Antoine.


Re: Clarifying interpretation of Buffer "length" field in Arrow protocol

2019-10-03 Thread Antoine Pitrou


Yeah, I think the spec should be strict.  And for convenience, I'd say
it should probably be the padded length (though I don't have a strong
opinion).

Regards

Antoine.


Le 03/10/2019 à 06:23, Micah Kornfield a écrit :
> Hi Wes,
> It seems fine to be flexible here.  However:
> 
> 
>> This could have implications for hashing or
>> comparisons, for example, so I think that having the flexibility to do
>> either is a good idea.
> 
> This statement of use-cases makes me a little nervous.  It seems like it
> could lead to bugs if a consumer is reading from two producers that use
> different alternatives?
> 
> Thanks,
> Micah
> 
> On Mon, Sep 30, 2019 at 5:24 PM Wes McKinney  wrote:
> 
>> I just updated my pull request from May adding language to clarify
>> what protocol writers are expected to set when producing the Arrow
>> binary protocol
>>
>> https://github.com/apache/arrow/pull/4370
>>
>> Implementations may allocate small buffers, or use memory which does
>> not meet the 8-byte minimal padding requirements of the Arrow
>> protocol. It becomes a question, then, whether to set the in-memory
>> buffer size or the padded size when producing the protocol.
>>
>> This PR states that either is acceptable. As an example, a 1-byte
>> validity buffer could have Buffer metadata stating that the size
>> either is 1 byte or 8 bytes. Either way, 7 bytes of padding must be
>> written to conform to the protocol. The metadata, therefore, reflects
>> the "intent" of the protocol writer for the protocol reader. If the
>> writer says the length is 1, then the protocol reader understands that
>> the writer does not expect the reader to concern itself with the 7
>> bytes of padding. This could have implications for hashing or
>> comparisons, for example, so I think that having the flexibility to do
>> either is a good idea.
>>
>> For an application that wants to guarantee that AVX512 instructions
>> can be used on all buffers on the receiver side, it would be
>> appropriate to include 512-bit padding in the accounting.
>>
>> Let me know if others think differently so we can have this properly
>> documented for the 1.0.0 Format release.
>>
>> Thanks,
>> Wes
>>
>