[jira] [Created] (ARROW-6790) [Release] Automatically disable integration test cases in release verification
Bryan Cutler created ARROW-6790: --- Summary: [Release] Automatically disable integration test cases in release verification Key: ARROW-6790 URL: https://issues.apache.org/jira/browse/ARROW-6790 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Bryan Cutler Assignee: Bryan Cutler If dev/release/verify-release-candidate.sh is run with selective testing and includes integration tests, the selected implementations should be the only ones enabled when running the integration test portion. For example: TEST_DEFAULT=0 \ TEST_CPP=1 \ TEST_JAVA=1 \ TEST_INTEGRATION=1 \ dev/release/verify-release-candidate.sh source 0.15.0 2 Should run integration only for C++ and Java -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Docker organization for development images
https://hub.docker.com/u/ktou In "Docker organization for development images" on Thu, 3 Oct 2019 15:10:25 +0200, Krisztián Szűcs wrote: > Hi, > > We've created a docker hub organisation called "arrowdev" > to host the images defined in the docker-compose.yml, see > the following commit [1]. > So now it is possible to speed up the image builds by pulling > the layers first, I suggest to use the --pull flag for building > images: `docker-compose build --pull cpp` > > We need to manually grant write access for committers and > PMC members, so please send me your dockerhub username. > > Thanks, Krisztian > > P.S. Github has recently introduced its packaging feature, so > we'll be able to experiment with hosting docker images on > GitHub directly which would handle our permission settings > out of the box. IMO we should try it once it is enabled for > the apache/arrow repository. > > [1]: > https://github.com/apache/arrow/commit/1165cdb85b92cefcf59ac39d35f42d168cc64517
[jira] [Created] (ARROW-6789) [Python] Automatically box bytes/buffer-like values yielded from `FlightServerBase.do_action` in Result values
Wes McKinney created ARROW-6789: --- Summary: [Python] Automatically box bytes/buffer-like values yielded from `FlightServerBase.do_action` in Result values Key: ARROW-6789 URL: https://issues.apache.org/jira/browse/ARROW-6789 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 1.0.0 This will help with less boilerplate for server implementations -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: uncertain about JIRA issue granularity
Hi John, It depends on what the change encompasses. If it affects the format then it would be nice to have tracking bugs in all languages to implement the feature (i.e. adding data to the footer). If it is an implementation specific feature then only the target languages need to be implemented (i.e. file descriptors). Thanks, Micah On Thu, Oct 3, 2019 at 4:42 PM John Muehlhausen wrote: > I thought I should open all of the issues for tracking even if I don't > implement all of them right away? > > On Thu, Oct 3, 2019 at 5:46 PM Antoine Pitrou wrote: > > > > > Le 04/10/2019 à 00:18, John Muehlhausen a écrit : > > > I need to create two (or more) issues for > > > custom_metadata in Footer ... > > > > > > https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E > > > > > > and > > > memory map based on fd ... > > > > > > https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E > > > > > > For the first one, is this five separate JIRA issues? > > > - [Format] add a custom_metadata:[KeyValue] field to the Footer table > in > > > File.fbs > > > - [C++] access File Footer custom_metadata > > > - [Python] access File Footer custom_metadata > > > - [JS] access File Footer custom_metadata > > > - [Java] access File Footer custom_metadata > > > > One JIRA per independent implementation, at least. Python and C++ can > > be in the same issue, since PyArrow is a set of wrappers around the C++ > > implementation. > > > > > For the second, is it four? (One per language?) > > > - [C++] retrieve fd of open memory mapped file and Open() memory mapped > > > file by fd > > > > Same. But why do you want to open this issue for every language? Do > > you really need this in all implementations? > > > > Regards > > > > Antoine. > > >
Re: uncertain about JIRA issue granularity
I thought I should open all of the issues for tracking even if I don't implement all of them right away? On Thu, Oct 3, 2019 at 5:46 PM Antoine Pitrou wrote: > > Le 04/10/2019 à 00:18, John Muehlhausen a écrit : > > I need to create two (or more) issues for > > custom_metadata in Footer ... > > > https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E > > > > and > > memory map based on fd ... > > > https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E > > > > For the first one, is this five separate JIRA issues? > > - [Format] add a custom_metadata:[KeyValue] field to the Footer table in > > File.fbs > > - [C++] access File Footer custom_metadata > > - [Python] access File Footer custom_metadata > > - [JS] access File Footer custom_metadata > > - [Java] access File Footer custom_metadata > > One JIRA per independent implementation, at least. Python and C++ can > be in the same issue, since PyArrow is a set of wrappers around the C++ > implementation. > > > For the second, is it four? (One per language?) > > - [C++] retrieve fd of open memory mapped file and Open() memory mapped > > file by fd > > Same. But why do you want to open this issue for every language? Do > you really need this in all implementations? > > Regards > > Antoine. >
Re: arrow::io::MemoryMappedFile from fd rather than path
Le 04/10/2019 à 00:31, John Muehlhausen a écrit : > http://lackingrhoticity.blogspot.com/2015/05/passing-fds-handles-between-processes.html > > If I'm reading this correctly, it doesn't affect our Open(fd) API on > Windows, but only how descriptors are communicated between processes that > want to make use of it. Yeah, well, that part will be completely different :-) But it's not part of Arrow concurrently (Plasma has it, but it's POSIX-only precisely). Regards Antoine.
Re: uncertain about JIRA issue granularity
Le 04/10/2019 à 00:18, John Muehlhausen a écrit : > I need to create two (or more) issues for > custom_metadata in Footer ... > https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E > > and > memory map based on fd ... > https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E > > For the first one, is this five separate JIRA issues? > - [Format] add a custom_metadata:[KeyValue] field to the Footer table in > File.fbs > - [C++] access File Footer custom_metadata > - [Python] access File Footer custom_metadata > - [JS] access File Footer custom_metadata > - [Java] access File Footer custom_metadata One JIRA per independent implementation, at least. Python and C++ can be in the same issue, since PyArrow is a set of wrappers around the C++ implementation. > For the second, is it four? (One per language?) > - [C++] retrieve fd of open memory mapped file and Open() memory mapped > file by fd Same. But why do you want to open this issue for every language? Do you really need this in all implementations? Regards Antoine.
Re: arrow::io::MemoryMappedFile from fd rather than path
http://lackingrhoticity.blogspot.com/2015/05/passing-fds-handles-between-processes.html If I'm reading this correctly, it doesn't affect our Open(fd) API on Windows, but only how descriptors are communicated between processes that want to make use of it. On Thu, Oct 3, 2019 at 4:24 PM Antoine Pitrou wrote: > > Le 03/10/2019 à 23:21, John Muehlhausen a écrit : > > > > Would we just make a variant of Open() that takes a fd rather than a > path? > > That sounds like a good idea. Would you like to open a JIRA and a PR? > > > Would this API have any analogy on Windows? Do we have platform-specific > > functionality? > > File descriptors exist on Windows, so it should be fine there as well. > > Regards > > Antoine. >
uncertain about JIRA issue granularity
I need to create two (or more) issues for custom_metadata in Footer ... https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E and memory map based on fd ... https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E For the first one, is this five separate JIRA issues? - [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs - [C++] access File Footer custom_metadata - [Python] access File Footer custom_metadata - [JS] access File Footer custom_metadata - [Java] access File Footer custom_metadata For the second, is it four? (One per language?) - [C++] retrieve fd of open memory mapped file and Open() memory mapped file by fd - [Python] retrieve fd of open memory mapped file and Open() memory mapped file by fd - [JS] retrieve fd of open memory mapped file and Open() memory mapped file by fd - [Java] retrieve fd of open memory mapped file and Open() memory mapped file by fd (I will also work on ARROW-5916 if I can carve out the time)
[jira] [Created] (ARROW-6788) [CI] Migrate Travis CI lint job to GitHub Actions
Wes McKinney created ARROW-6788: --- Summary: [CI] Migrate Travis CI lint job to GitHub Actions Key: ARROW-6788 URL: https://issues.apache.org/jira/browse/ARROW-6788 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Wes McKinney Fix For: 1.0.0 Depends on ARROW-5802. As far as I can tell GitHub Actions jobs run more or less immediately so this will give more prompt feedback to contributors -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: arrow::io::MemoryMappedFile from fd rather than path
Le 03/10/2019 à 23:21, John Muehlhausen a écrit : > > Would we just make a variant of Open() that takes a fd rather than a path? That sounds like a good idea. Would you like to open a JIRA and a PR? > Would this API have any analogy on Windows? Do we have platform-specific > functionality? File descriptors exist on Windows, so it should be fine there as well. Regards Antoine.
[jira] [Created] (ARROW-6787) [CI] Decommission "C++ with clang 7 and system packages" Travis CI job
Wes McKinney created ARROW-6787: --- Summary: [CI] Decommission "C++ with clang 7 and system packages" Travis CI job Key: ARROW-6787 URL: https://issues.apache.org/jira/browse/ARROW-6787 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Now that this is running in GitHub Actions, we can probably skip it in Travis CI? Any other barriers to turning this off and saving the CI build time? -- This message was sent by Atlassian Jira (v8.3.4#803005)
arrow::io::MemoryMappedFile from fd rather than path
I have a situation where multiple processes need to access a memory mapped file. However, between the time the first process maps the file and the time a subsequent process in the group maps the file, the file may have been removed from the filesystem. (I.e. has no "path") Coordinating the cache pruner (which would remove the file) to not affect the overall "atomicity" of the process group would be a real chore. Therefore I need to communicate and use the file descriptor rather than the path name when subsequent processes map the file. (Using SCM_RIGHTS on a unix socket, /proc/.../fd ... as a couple of ways that come to mind cannot inherit the fd since the parent proc is often the late joiner.) Would we just make a variant of Open() that takes a fd rather than a path? Related to this, need to be able to discover the fd of a mapped file and need these APIs in Python as well. Would this API have any analogy on Windows? Do we have platform-specific functionality? Thoughts? -John
[jira] [Created] (ARROW-6786) [C++] arrow-dataset-file-parquet-test is slow
Antoine Pitrou created ARROW-6786: - Summary: [C++] arrow-dataset-file-parquet-test is slow Key: ARROW-6786 URL: https://issues.apache.org/jira/browse/ARROW-6786 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou It takes 15 seconds in debug mode (probably more with ASAN / UBSAN /etc.) to run 2 tests that simply iterated through a generated in-memory dataset: {code} $ ./build-test/debug/arrow-dataset-file-parquet-test Running main() from /home/conda/feedstock_root/build_artifacts/gtest_1551008230529/work/googletest/src/gtest_main.cc [==] Running 2 tests from 1 test case. [--] Global test environment set-up. [--] 2 tests from TestParquetFileFormat [ RUN ] TestParquetFileFormat.ScanRecordBatchReader [ OK ] TestParquetFileFormat.ScanRecordBatchReader (7338 ms) [ RUN ] TestParquetFileFormat.Inspect [ OK ] TestParquetFileFormat.Inspect (6222 ms) [--] 2 tests from TestParquetFileFormat (13560 ms total) [--] Global test environment tear-down [==] 2 tests from 1 test case ran. (13560 ms total) [ PASSED ] 2 tests. {code} Unless it is stressing something in particular, the number of repetitions or the batch size can probably be reduced dramatically. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Collecting Arrow critique and our roadmap on that
A lot of good info here, I added a point that has come up often for me. On Thu, Oct 3, 2019 at 10:03 AM Wes McKinney wrote: > I read through and left some comments. > > Would be great to turn into an FAQ section in the docs and add a link > to the navigation on the front page of the website. > > On Mon, Sep 23, 2019 at 1:22 PM Uwe L. Korn wrote: > > > > Thanks to the all contributions that already came in. I made some more > additions and hope to turn this into a PR to the site soon. > > > > Uwe > > > > On Fri, Sep 20, 2019, at 10:46 AM, Micah Kornfield wrote: > > > I think this is a good idea, as well. I added comments and additions > on > > > the document. > > > > > > On Thu, Sep 19, 2019 at 11:47 AM Neal Richardson < > > > neal.p.richard...@gmail.com> wrote: > > > > > > > Uwe, I think this is an excellent idea. I've started > > > > > > > > > https://docs.google.com/document/d/1cgN7mYzH30URDTaioHsCP2d80wKKHDNs9f5s7vdb2mA/edit?usp=sharing > > > > to collect some ideas and notes. Once we have gathered our thoughts > > > > there, we can put them in the appropriate places. > > > > > > > > I think that some of the result will go into the FAQ, some into > > > > documentation (maybe more "how-to" and "getting started" guides in > the > > > > respective language docs, as well as some "how to share Arrow data > > > > from X to Y"), and other things that we haven't yet done should go > > > > into a sort of Roadmap document on the main website. We have some > very > > > > outdated content related to a roadmap on the confluence wiki that > > > > should be folded in as appropriate too. > > > > > > > > Neal > > > > > > > > On Thu, Sep 19, 2019 at 10:26 AM Uwe L. Korn > wrote: > > > > > > > > > > Hello, > > > > > > > > > > there has been a lot of public discussions lately with some > mentions of > > > > actually informed, valid critique of things in the Arrow project. > From my > > > > perspective, these things include "there is not STL-native C++ Arrow > API", > > > > "the base build requires too much dependencies", "the pyarrow > package is > > > > really huge and you cannot select single components". These are > things we > > > > cannot tackle at the moment due to the lack of contributors to the > project. > > > > But we can use this as a basis to point people that critique the > project on > > > > this that this is not intentional but a lack of resources as well as > it > > > > provides another point of entry for new contributors looking for > work. > > > > > > > > > > Thus I would like to start a document (possibly on the website) > where we > > > > list the major critiques on Arrow, mention our long-term solution to > that > > > > and what JIRAs need to be done for that. > > > > > > > > > > Would that be something others would also see as valuable? > > > > > > > > > > There has also been a lot of uninformed criticism, I think that > can be > > > > best combat by documentation, blog posts and public appearances at > > > > conferences and is not covered by this proposal. > > > > > > > > > > Uwe > > > > > > > >
[jira] [Created] (ARROW-6785) [JS] Remove superfluous child assignment
Wes McKinney created ARROW-6785: --- Summary: [JS] Remove superfluous child assignment Key: ARROW-6785 URL: https://issues.apache.org/jira/browse/ARROW-6785 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Wes McKinney Fix For: 1.0.0 Per PR -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6784) [C++][R] Move filter, take, select C++ code from Rcpp to C++ library
Neal Richardson created ARROW-6784: -- Summary: [C++][R] Move filter, take, select C++ code from Rcpp to C++ library Key: ARROW-6784 URL: https://issues.apache.org/jira/browse/ARROW-6784 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Neal Richardson Fix For: 1.0.0 Followup to ARROW-3808 and some other previous work. Of particular interest: * Filter and Take methods for ChunkedArray, in r/src/compute.cpp * Methods for that and some other things that apply Array and ChunkedArray methods across the columns of a RecordBatch or Table, respectively * RecordBatch__select and Table__select to take columns -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Docker organization for development images
Sounds good, thanks Krisztian! On Thu, Oct 3, 2019 at 6:10 AM Krisztián Szűcs wrote: > Hi, > > We've created a docker hub organisation called "arrowdev" > to host the images defined in the docker-compose.yml, see > the following commit [1]. > So now it is possible to speed up the image builds by pulling > the layers first, I suggest to use the --pull flag for building > images: `docker-compose build --pull cpp` > > We need to manually grant write access for committers and > PMC members, so please send me your dockerhub username. > > Thanks, Krisztian > > P.S. Github has recently introduced its packaging feature, so > we'll be able to experiment with hosting docker images on > GitHub directly which would handle our permission settings > out of the box. IMO we should try it once it is enabled for > the apache/arrow repository. > > [1]: > > https://github.com/apache/arrow/commit/1165cdb85b92cefcf59ac39d35f42d168cc64517 >
Re: Collecting Arrow critique and our roadmap on that
I read through and left some comments. Would be great to turn into an FAQ section in the docs and add a link to the navigation on the front page of the website. On Mon, Sep 23, 2019 at 1:22 PM Uwe L. Korn wrote: > > Thanks to the all contributions that already came in. I made some more > additions and hope to turn this into a PR to the site soon. > > Uwe > > On Fri, Sep 20, 2019, at 10:46 AM, Micah Kornfield wrote: > > I think this is a good idea, as well. I added comments and additions on > > the document. > > > > On Thu, Sep 19, 2019 at 11:47 AM Neal Richardson < > > neal.p.richard...@gmail.com> wrote: > > > > > Uwe, I think this is an excellent idea. I've started > > > > > > https://docs.google.com/document/d/1cgN7mYzH30URDTaioHsCP2d80wKKHDNs9f5s7vdb2mA/edit?usp=sharing > > > to collect some ideas and notes. Once we have gathered our thoughts > > > there, we can put them in the appropriate places. > > > > > > I think that some of the result will go into the FAQ, some into > > > documentation (maybe more "how-to" and "getting started" guides in the > > > respective language docs, as well as some "how to share Arrow data > > > from X to Y"), and other things that we haven't yet done should go > > > into a sort of Roadmap document on the main website. We have some very > > > outdated content related to a roadmap on the confluence wiki that > > > should be folded in as appropriate too. > > > > > > Neal > > > > > > On Thu, Sep 19, 2019 at 10:26 AM Uwe L. Korn wrote: > > > > > > > > Hello, > > > > > > > > there has been a lot of public discussions lately with some mentions of > > > actually informed, valid critique of things in the Arrow project. From my > > > perspective, these things include "there is not STL-native C++ Arrow API", > > > "the base build requires too much dependencies", "the pyarrow package is > > > really huge and you cannot select single components". These are things we > > > cannot tackle at the moment due to the lack of contributors to the > > > project. > > > But we can use this as a basis to point people that critique the project > > > on > > > this that this is not intentional but a lack of resources as well as it > > > provides another point of entry for new contributors looking for work. > > > > > > > > Thus I would like to start a document (possibly on the website) where we > > > list the major critiques on Arrow, mention our long-term solution to that > > > and what JIRAs need to be done for that. > > > > > > > > Would that be something others would also see as valuable? > > > > > > > > There has also been a lot of uninformed criticism, I think that can be > > > best combat by documentation, blog posts and public appearances at > > > conferences and is not covered by this proposal. > > > > > > > > Uwe > > > > >
[jira] [Created] (ARROW-6783) [C++] Provide API for reconstruction of RecordBatch from Flatbuffer containing process memory addresses instead of relative offsets into an IPC message
Wes McKinney created ARROW-6783: --- Summary: [C++] Provide API for reconstruction of RecordBatch from Flatbuffer containing process memory addresses instead of relative offsets into an IPC message Key: ARROW-6783 URL: https://issues.apache.org/jira/browse/ARROW-6783 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 A lot of our development has focused on _inter_process communication rather than _in_process. We should start by making sure we have disassembly and reassembly implemented where the Buffer Flatbuffers values contain process memory addresses rather than offsets. This may require a bit of refactoring so we can use the same reassembly code path for both use cases -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6782) [C++] Build minimal core Arrow libraries without any Boost headers
Wes McKinney created ARROW-6782: --- Summary: [C++] Build minimal core Arrow libraries without any Boost headers Key: ARROW-6782 URL: https://issues.apache.org/jira/browse/ARROW-6782 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 We have a couple of places where these are used. It would be good to be able to build without any Boost headers available -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] raw pointers and FFI (C-level in-process array protocol)
Related: Gandiva invented its own particular way of passing memory addresses through the JNI boundary rather than using Flatbuffers messages https://github.com/apache/arrow/blob/master/cpp/src/gandiva/jni/jni_common.cc#L505 I'm all for language-agnostic in-memory data passing, but there is a use case for a C API to pass pointers at call sites while avoiding flattening (disassembly) and unflattening (reassembly) steps. On Thu, Oct 3, 2019 at 4:34 AM Antoine Pitrou wrote: > > > Hi Jacques, > > Le 03/10/2019 à 02:46, Jacques Nadeau a écrit : > > > > I think it is reasonable to argue that keeping any ABI (or header/struct > > pattern) as narrow as possible would allow us to minimize overlap with the > > existing in-memory specification. In Arrow's case, this could be as simple > > as a single memory pointer for schema (backed by flatbuffers) and a single > > memory location for data (that references the record batch header, which in > > turn provides pointers into the actual arrow data). [...] > > > > [...] (For example, in a JVM > > view of the world, working with a plain struct in java rather than a set of > > memory pointers against our existing IPC formats would be quite painful and > > we'd definitely need to create some glue code for users. I worry the same > > pattern would occur in many other languages.) > > I'm trying to understand the point you're making. Here you say that it > was difficult for the JVM to deal with raw pointers. But above you seem > to argue for a flatbuffers-based serialization containing raw pointers. > > Here's another way to frame the question: how do you propose to do > zero-copy between different languages if not by passing raw pointers to > the Arrow data? And if passing raw pointers is acceptable, what is > wrong with the spec as proposed? > > > As for creating glue code: yes, of course, that would be needed in most > languages that want to provide this interface (including C++). You do > need a C FFI for that. I'm quite sure it would be possible to implement > this proposal in pure Python with ctypes / cffi, for example (as a toy > example, since PyArrow exists :-)). When writing the spec, I also took > a look at the Go and Rust FFIs, and they seem good enough to interact > with it. I tried to take a look at JNI, but of course I got lost in the > documentation :-) > > If you are worried that people start thinking that this proposal is part > of the Arrow specification, perhaps we can make it clear that exposing > this interface is optional for implementations. > > Regards > > Antoine.
[jira] [Created] (ARROW-6781) [C++] Improve and consolidate ARROW_CHECK, DCHECK macros
Ben Kietzman created ARROW-6781: --- Summary: [C++] Improve and consolidate ARROW_CHECK, DCHECK macros Key: ARROW-6781 URL: https://issues.apache.org/jira/browse/ARROW-6781 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Ben Kietzman Assignee: Ben Kietzman Currently we have multiple macros like {{DCHECK_EQ}} and {{DCHECK_LT}} which check various comparisons but don't report anything about their operands. Furthermore, the "stream to assertion" pattern for appending extra info has proven fragile. I propose a new unified macro which can capture operands of comparisons and report them: {code:cpp} int three = 3; int five = 5; DCHECK(three == five, "extra: ", 1, 2, five); {code} Results in check failure messages like: {code} F1003 11:12:46.174767 4166 logging_test.cc:141] Check failed: three == five LHS: 3 RHS: 5 extra: 125 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2019-10-03-0
Arrow Build Report for Job nightly-2019-10-03-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0 Failed Tasks: - wheel-manylinux1-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-manylinux1-cp37m - wheel-manylinux1-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-manylinux1-cp36m - wheel-win-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-appveyor-wheel-win-cp36m - wheel-manylinux1-cp27mu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-manylinux1-cp27mu - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-win-vs2015-py37 - wheel-win-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-appveyor-wheel-win-cp37m - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-osx-clang-py36 - wheel-manylinux2010-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-manylinux2010-cp35m - wheel-osx-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-osx-cp37m - conda-linux-gcc-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-linux-gcc-py27 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-linux-gcc-py37 - wheel-osx-cp27m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-osx-cp27m - wheel-win-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-appveyor-wheel-win-cp35m - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-linux-gcc-py36 - docker-spark-integration: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-spark-integration - wheel-manylinux2010-cp27mu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-manylinux2010-cp27mu - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-win-vs2015-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-osx-clang-py37 - conda-osx-clang-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-conda-osx-clang-py27 Succeeded Tasks: - docker-js: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-js - docker-lint: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-lint - docker-c_glib: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-c_glib - docker-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-python-3.6 - ubuntu-disco-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-ubuntu-disco-arm64 - docker-iwyu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-iwyu - docker-rust: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-rust - wheel-osx-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-wheel-osx-cp36m - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-travis-gandiva-jar-osx - docker-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-cpp - ubuntu-bionic-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-ubuntu-bionic-arm64 - docker-r: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-r - docker-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-pandas-master - docker-dask-integration: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-dask-integration - docker-cpp-cmake32: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-cpp-cmake32 - docker-r-conda: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-circle-docker-r-conda - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-03-0-azure-centos-6 - docker-clang-format: URL: https://github.com/ursa-labs/cros
[jira] [Created] (ARROW-6780) [C++][Parquet] Support DurationType in writing/reading parquet
Joris Van den Bossche created ARROW-6780: Summary: [C++][Parquet] Support DurationType in writing/reading parquet Key: ARROW-6780 URL: https://issues.apache.org/jira/browse/ARROW-6780 Project: Apache Arrow Issue Type: Improvement Reporter: Joris Van den Bossche Currently this is not supported: {code} In [37]: table = pa.table({'a': pa.array([1, 2], pa.duration('s'))}) In [39]: table Out[39]: pyarrow.Table a: duration[s] In [41]: pq.write_table(table, 'test_duration.parquet') ... ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: duration[s] {code} There is no direct mapping to Parquet logical types. There is an INTERVAL type, but this more matches Arrow's ( YEAR_MONTH or DAY_TIME) interval type. But, those duration values could be stored as just integers, and based on the serialized arrow schema, it could be restored when reading back in. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6779) [Python] Conversion from datetime.datetime to timstamp('ns') can overflow
Joris Van den Bossche created ARROW-6779: Summary: [Python] Conversion from datetime.datetime to timstamp('ns') can overflow Key: ARROW-6779 URL: https://issues.apache.org/jira/browse/ARROW-6779 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche In the python conversion of datetime scalars, there is no check for integer overflow: {code} In [32]: pa.array([datetime.datetime(3000, 1, 1)], pa.timestamp('ns')) Out[32]: [ 1830-11-23 00:50:52.580896768 ] {code} So in case the target type has nanosecond unit, this can give wrong results (I don't think the other resolutions can reach overflow, given the limited range of years of datetime.datetime). We should probably check for this case and raise an error. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6778) [C++] Support DurationType in Cast kernel
Joris Van den Bossche created ARROW-6778: Summary: [C++] Support DurationType in Cast kernel Key: ARROW-6778 URL: https://issues.apache.org/jira/browse/ARROW-6778 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
Docker organization for development images
Hi, We've created a docker hub organisation called "arrowdev" to host the images defined in the docker-compose.yml, see the following commit [1]. So now it is possible to speed up the image builds by pulling the layers first, I suggest to use the --pull flag for building images: `docker-compose build --pull cpp` We need to manually grant write access for committers and PMC members, so please send me your dockerhub username. Thanks, Krisztian P.S. Github has recently introduced its packaging feature, so we'll be able to experiment with hosting docker images on GitHub directly which would handle our permission settings out of the box. IMO we should try it once it is enabled for the apache/arrow repository. [1]: https://github.com/apache/arrow/commit/1165cdb85b92cefcf59ac39d35f42d168cc64517
Re: Clarifying interpretation of Buffer "length" field in Arrow protocol
On Thu, Oct 3, 2019 at 7:33 AM Antoine Pitrou wrote: > > > Le 03/10/2019 à 14:22, Wes McKinney a écrit : > > On Thu, Oct 3, 2019 at 4:26 AM Antoine Pitrou wrote: > >> > >> > >> Yeah, I think the spec should be strict. And for convenience, I'd say > >> it should probably be the padded length (though I don't have a strong > >> opinion). > > > > The reason I'm against this is that it makes it impossible for a > > producer to preserve the exact state of its buffers for a consumer. > > > > For example, if you have a 1-byte validity bitmap, and you do not have > > the flexibility to indicate in the metadata that the length is either > > 1 (unpadded) or 8 (padded), then the producer only will ever see 8 > > bytes. > > I see. Then we should mandate the non-padded length, IMHO. I think all that needs to be said is that an unpadded size is not invalid. If a consumer is passed a buffer that is larger than it needs to be, there is no harm done. I can tweak the language so that there is less uncertainty perhaps > Regards > > Antoine.
Re: Clarifying interpretation of Buffer "length" field in Arrow protocol
Le 03/10/2019 à 14:22, Wes McKinney a écrit : > On Thu, Oct 3, 2019 at 4:26 AM Antoine Pitrou wrote: >> >> >> Yeah, I think the spec should be strict. And for convenience, I'd say >> it should probably be the padded length (though I don't have a strong >> opinion). > > The reason I'm against this is that it makes it impossible for a > producer to preserve the exact state of its buffers for a consumer. > > For example, if you have a 1-byte validity bitmap, and you do not have > the flexibility to indicate in the metadata that the length is either > 1 (unpadded) or 8 (padded), then the producer only will ever see 8 > bytes. I see. Then we should mandate the non-padded length, IMHO. Regards Antoine.
Re: Clarifying interpretation of Buffer "length" field in Arrow protocol
On Thu, Oct 3, 2019 at 4:26 AM Antoine Pitrou wrote: > > > Yeah, I think the spec should be strict. And for convenience, I'd say > it should probably be the padded length (though I don't have a strong > opinion). The reason I'm against this is that it makes it impossible for a producer to preserve the exact state of its buffers for a consumer. For example, if you have a 1-byte validity bitmap, and you do not have the flexibility to indicate in the metadata that the length is either 1 (unpadded) or 8 (padded), then the producer only will ever see 8 bytes. Note that padding is only performed in context of the encapsulated IPC format. If the metadata is used to communicate in-memory pointers then it is not appropriate to pad lengths if they are not already padded. > Regards > > Antoine. > > > Le 03/10/2019 à 06:23, Micah Kornfield a écrit : > > Hi Wes, > > It seems fine to be flexible here. However: > > > > > >> This could have implications for hashing or > >> comparisons, for example, so I think that having the flexibility to do > >> either is a good idea. > > > > This statement of use-cases makes me a little nervous. It seems like it > > could lead to bugs if a consumer is reading from two producers that use > > different alternatives? > > > > Thanks, > > Micah > > > > On Mon, Sep 30, 2019 at 5:24 PM Wes McKinney wrote: > > > >> I just updated my pull request from May adding language to clarify > >> what protocol writers are expected to set when producing the Arrow > >> binary protocol > >> > >> https://github.com/apache/arrow/pull/4370 > >> > >> Implementations may allocate small buffers, or use memory which does > >> not meet the 8-byte minimal padding requirements of the Arrow > >> protocol. It becomes a question, then, whether to set the in-memory > >> buffer size or the padded size when producing the protocol. > >> > >> This PR states that either is acceptable. As an example, a 1-byte > >> validity buffer could have Buffer metadata stating that the size > >> either is 1 byte or 8 bytes. Either way, 7 bytes of padding must be > >> written to conform to the protocol. The metadata, therefore, reflects > >> the "intent" of the protocol writer for the protocol reader. If the > >> writer says the length is 1, then the protocol reader understands that > >> the writer does not expect the reader to concern itself with the 7 > >> bytes of padding. This could have implications for hashing or > >> comparisons, for example, so I think that having the flexibility to do > >> either is a good idea. > >> > >> For an application that wants to guarantee that AVX512 instructions > >> can be used on all buffers on the receiver side, it would be > >> appropriate to include 512-bit padding in the accounting. > >> > >> Let me know if others think differently so we can have this properly > >> documented for the 1.0.0 Format release. > >> > >> Thanks, > >> Wes > >> > >
Re: [DISCUSS] Result vs Status
Le 03/10/2019 à 06:13, Micah Kornfield a écrit : > > It was my impression that we had workable solutions for using Result in at > least Python and Glib/Ruby (I'm don't know about R). In Python we do (though it needed a C++-side helper). Regards Antoine.
Re: [DISCUSS] raw pointers and FFI (C-level in-process array protocol)
Hi Jacques, Le 03/10/2019 à 02:46, Jacques Nadeau a écrit : > > I think it is reasonable to argue that keeping any ABI (or header/struct > pattern) as narrow as possible would allow us to minimize overlap with the > existing in-memory specification. In Arrow's case, this could be as simple > as a single memory pointer for schema (backed by flatbuffers) and a single > memory location for data (that references the record batch header, which in > turn provides pointers into the actual arrow data). [...] > > [...] (For example, in a JVM > view of the world, working with a plain struct in java rather than a set of > memory pointers against our existing IPC formats would be quite painful and > we'd definitely need to create some glue code for users. I worry the same > pattern would occur in many other languages.) I'm trying to understand the point you're making. Here you say that it was difficult for the JVM to deal with raw pointers. But above you seem to argue for a flatbuffers-based serialization containing raw pointers. Here's another way to frame the question: how do you propose to do zero-copy between different languages if not by passing raw pointers to the Arrow data? And if passing raw pointers is acceptable, what is wrong with the spec as proposed? As for creating glue code: yes, of course, that would be needed in most languages that want to provide this interface (including C++). You do need a C FFI for that. I'm quite sure it would be possible to implement this proposal in pure Python with ctypes / cffi, for example (as a toy example, since PyArrow exists :-)). When writing the spec, I also took a look at the Go and Rust FFIs, and they seem good enough to interact with it. I tried to take a look at JNI, but of course I got lost in the documentation :-) If you are worried that people start thinking that this proposal is part of the Arrow specification, perhaps we can make it clear that exposing this interface is optional for implementations. Regards Antoine.
Re: Clarifying interpretation of Buffer "length" field in Arrow protocol
Yeah, I think the spec should be strict. And for convenience, I'd say it should probably be the padded length (though I don't have a strong opinion). Regards Antoine. Le 03/10/2019 à 06:23, Micah Kornfield a écrit : > Hi Wes, > It seems fine to be flexible here. However: > > >> This could have implications for hashing or >> comparisons, for example, so I think that having the flexibility to do >> either is a good idea. > > This statement of use-cases makes me a little nervous. It seems like it > could lead to bugs if a consumer is reading from two producers that use > different alternatives? > > Thanks, > Micah > > On Mon, Sep 30, 2019 at 5:24 PM Wes McKinney wrote: > >> I just updated my pull request from May adding language to clarify >> what protocol writers are expected to set when producing the Arrow >> binary protocol >> >> https://github.com/apache/arrow/pull/4370 >> >> Implementations may allocate small buffers, or use memory which does >> not meet the 8-byte minimal padding requirements of the Arrow >> protocol. It becomes a question, then, whether to set the in-memory >> buffer size or the padded size when producing the protocol. >> >> This PR states that either is acceptable. As an example, a 1-byte >> validity buffer could have Buffer metadata stating that the size >> either is 1 byte or 8 bytes. Either way, 7 bytes of padding must be >> written to conform to the protocol. The metadata, therefore, reflects >> the "intent" of the protocol writer for the protocol reader. If the >> writer says the length is 1, then the protocol reader understands that >> the writer does not expect the reader to concern itself with the 7 >> bytes of padding. This could have implications for hashing or >> comparisons, for example, so I think that having the flexibility to do >> either is a good idea. >> >> For an application that wants to guarantee that AVX512 instructions >> can be used on all buffers on the receiver side, it would be >> appropriate to include 512-bit padding in the accounting. >> >> Let me know if others think differently so we can have this properly >> documented for the 1.0.0 Format release. >> >> Thanks, >> Wes >> >