[Java] Append multiple record batches together?
Hi, A colleague opened up https://issues.apache.org/jira/browse/ARROW-7048 for having similar functionality to the python APIs that allow for creating one larger data structure from a series of record batches. I just wanted to surface it here in case: 1. An efficient solution already exists? It seems like TransferPair implementations could possibly be improved upon or have they already been optimized? 2. What the preferred API for doing this would be? Some options i can think of: * VectorSchemaRoot.concat(Collection) * VectorSchemaRoot.from(Collection) * VectorLoader.load(Collection) Thanks, Micah
[jira] [Created] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels
Micah Kornfield created ARROW-7083: -- Summary: [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels Key: ARROW-7083 URL: https://issues.apache.org/jira/browse/ARROW-7083 Project: Apache Arrow Issue Type: Improvement Reporter: Micah Kornfield See discussion on [https://issues.apache.org/jira/browse/ARROW-7017] Requirements: 1. No hard runtime dependency on LLVM 2. Ability to run without JIT. Open questions: 1. What dependencies does this add to the build tool chain? -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Result vs Status
This seems reasonable to me. Give the impact of the API changes I think it might be worth keeping around for ~3 releases, but I think we are generally slow to delete deprecated APIs anyways. Any other thoughts on this? i can try to open up some tracking JIRAs for the work involved. On Wed, Oct 30, 2019 at 1:25 PM Wes McKinney wrote: > Returning to this discussion. > > Here is my position on the matter since this was brought up on the > sync call today > > * For internal / non-public and pseudo-non-public APIs that have > return/out values > - Use Result or Status at discretion of the developer, but Result > is preferable > > * For new public APIs with return/out values > - Prefer Result unless a Status-based API seems definitely less > awkward in real world use. I have to say that I'm skeptical about the > relative usability of std::tuple outputs and don't think we should > force the use of Result for technical purity reasons > > * For existing Status APIs with return values > - Incrementally add Result APIs and deprecate Status-based APIs. > Maintain deprecated Status APIs for ~2 major releases > > On Thu, Oct 24, 2019 at 5:16 PM Omer F. Ozarslan > wrote: > > > > Hi Micah, > > > > You're right. Quite possible that clang-query counted same function > > separately for each include in each file. (I was iterating each file > > separately, but providing all of them at once didn't change the result > > either.) > > > > It's cool and wrong, so not very useful apparently. :-) > > > > Best, > > Omer > > > > On Thu, Oct 24, 2019 at 4:51 PM Micah Kornfield > wrote: > > > > > > Hi Omer, > > > I think this is really cool. It is quite possible it was > underestimated (I agree about line lengths), but I think the clang query is > double counting somehow. > > > > > > For instance: > > > > > > "grep -r Status *" only returns ~9000 results in total for me. > > > > > > Similarly using grep for "FinishTyped" returns 18 results for me. > Searching through the log that you linked seems to return 450 (for "Status > FinishTyped"). > > > > > > It is quite possible, I'm doing something naive with grep. > > > > > > Thanks, > > > Micah > > > > > > On Thu, Oct 24, 2019 at 2:41 PM Omer F. Ozarslan > wrote: > > >> > > >> Forgot to mention most of those lines are longer than line width while > > >> out is usually (always?) last parameter, so probably that's why grep > > >> possibly underestimates their number. > > >> > > >> On Thu, Oct 24, 2019 at 4:33 PM Omer F. Ozarslan > wrote: > > >> > > > >> > Hi, > > >> > > > >> > I don't have much experience on customized clang-tidy plugins, but > > >> > this might be a good use case for such a plugin from what I read > here > > >> > and there (frankly this was a good excuse for me to have a look at > > >> > clang tooling as well). I wanted to ensure it isn't obviously > overkill > > >> > before this suggestion: Running a clang query which lists functions > > >> > returning `arrow::Status` and taking a pointer parameter named `out` > > >> > showed that there are 13947 such functions in `cpp/src/**/*.h`. [1] > > >> > > > >> > I checked logs and it seemed legitimate to me, but please check it > in > > >> > case I missed something. If that's the case, it might be tedious to > do > > >> > this work manually. > > >> > > > >> > [1]: https://gist.github.com/ozars/ecbb1b8acd4a57ba4721c1965f83f342 > > >> > (Note that the log file is shown as truncated by github after ~30k > > >> > lines) > > >> > > > >> > Best, > > >> > Omer > > >> > > > >> > > > >> > > > >> > On Wed, Oct 23, 2019 at 9:23 PM Micah Kornfield < > emkornfi...@gmail.com> wrote: > > >> > > > > >> > > OK, it sounds like people want Result (at least in some > circumstances). > > >> > > Any thoughts on migrating old APIs and what to do for new APIs > going > > >> > > forward? > > >> > > > > >> > > A very rough approximation [1] yields the following counts by > module: > > >> > > > > >> > > 853 arrow > > >> > > > > >> > > 17 gandiva > > >> > > > > >> > > 25 parquet > > >> > > > > >> > > 50 plasma > > >> > > > > >> > > > > >> > > > > >> > > [1] grep -r Status cpp/src/* |grep ".h:" | grep "\\*" |grep -v > Accept |sed > > >> > > s/:.*// | cut -f3 -d/ |sort > > >> > > > > >> > > > > >> > > Thanks, > > >> > > > > >> > > Micah > > >> > > > > >> > > > > >> > > > > >> > > On Sat, Oct 19, 2019 at 7:50 PM Francois Saint-Jacques < > > >> > > fsaintjacq...@gmail.com> wrote: > > >> > > > > >> > > > As mentioned, Result is an improvement for function which > returns a > > >> > > > single value, e.g. Make/Factory-like. My vote goes Result > for such > > >> > > > case. For multiple return types, we have std::tuple like Antoine > > >> > > > proposed. > > >> > > > > > >> > > > François > > >> > > > > > >> > > > On Fri, Oct 18, 2019 at 9:19 PM Antoine Pitrou < > anto...@python.org> wrote: > > >> > > > > > > >> > > > > > > >> > > > > Le 18/10/2019 à 20:58, Wes McKinney a écrit : > > >> > > > > > I'm definitely uncomfortable with the idea of
Re: Saving Binary Arrow memory objects as blobs in Cassandra
I suggest you use the IPC protocol http://arrow.apache.org/docs/python/ipc.html This protocol will be considered stable starting with the 1.0.0 release but I would guess (without making any guarantees) that blobs written with 0.15.1 will be readable in 1.0.0 and beyond. On Wed, Nov 6, 2019 at 12:22 PM Lee, David wrote: > > Is there anyway to save Arrow memory as a blob? I tried using Feather and > Parquet, but neither one supports writing complex nested structures yet. > > I tried with the following test file. > > test.jsonl: > {"a": 1, "b": "abc", "c": [1, 2], "d": {"e": true, "f": "1991-02-03"}, "g": > [{"h": 1, "i": "a"}, {"h": 2, "i": "b"}]} > {"a": 2, "b": "xyz", "c": [3, 4], "d": {"e": false, "f": "2010-01-15"}, "g": > [{"h": 3, "i": "c"}, {"h": 2, "i": "d"}]} > > code: > import pyarrow.json as json > arrow_mem = json.read_json("test.jsonl") > > Trying something out.. > > Storing Arrow Data in Cassandra for fast retrieval with primary keys. > Solr indexing the Arrow Data blob for Cassandra retrieval by primary key. > > This message may contain information that is confidential or privileged. If > you are not the intended recipient, please advise the sender immediately and > delete this message. See > http://www.blackrock.com/corporate/compliance/email-disclaimers for further > information. Please refer to > http://www.blackrock.com/corporate/compliance/privacy-policy for more > information about BlackRock’s Privacy Policy. > For a list of BlackRock's office addresses worldwide, see > http://www.blackrock.com/corporate/about-us/contacts-locations. > > © 2019 BlackRock, Inc. All rights reserved.
[jira] [Created] (ARROW-7082) [Packaging][deb] Add apache-arrow-archive-keyring
Kouhei Sutou created ARROW-7082: --- Summary: [Packaging][deb] Add apache-arrow-archive-keyring Key: ARROW-7082 URL: https://issues.apache.org/jira/browse/ARROW-7082 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7081) [R] Add methods for introspecting parquet files
Ben Kietzman created ARROW-7081: --- Summary: [R] Add methods for introspecting parquet files Key: ARROW-7081 URL: https://issues.apache.org/jira/browse/ARROW-7081 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 0.15.1 Reporter: Ben Kietzman Assignee: Neal Richardson Fix For: 1.0.0 Parquet files are very opaque, and it'd be handy to have an easy way to introspect them. Functions exist for loading them as a table, but information about row group level metadata and data page compression is hidden. Ideally, every structure from https://github.com/apache/parquet-format/#file-format could be examined in this fashion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Achieving parity with Java extension types in Python
Hi. I'm looking into this issue and I have some questions as someone new to the project. The comment from Joris earlier in the thread suggests that the solution here is to create an Array subclass for each extension type that wants to use one. This will give a nice symmetry w.r.t. the Java interface, but in the Python case, this seems to suggest having to travel some fairly byzantine code paths (rather quickly, we end up in C++ code, where I lose the thread of what's happening—specifically as regards `pyarrow_wrap_array`, as suggested in ARROW-6176). I came up with a quick-and-dirty method wherein the ExtensionType subclass simply provides a method to translate from the storage type to the output type, and ExtensionArray has a __getitem__ implementation that passes the element from storage through the translation function. This doesn't feel outside of the realm of what is often acceptable in the python world, but it isn't nearly as typeful as Arrow seems to be leaning. Plus, this feels very far from what was intended in the issue, and I believe that I'm not understanding the underlying design principles. Can I get a bit of advice on this? Thanks. -J On Tue, Oct 29, 2019 at 12:26 PM Justin Polchlopek wrote: > That sounds about right. We're doing some work here that might require > this feature sooner than later, and if we decide to go the route that needs > this improved support, I'd be happy to make this PR. Thanks for showing > that issue. I'll be sure to tag any contribution with that ticket number. > > On Tue, Oct 29, 2019 at 9:01 AM Joris Van den Bossche < > jorisvandenboss...@gmail.com> wrote: > >> >> On Mon, 28 Oct 2019 at 22:41, Wes McKinney wrote: >> >>> Adding dev@ >>> >>> I don't believe we have APIs yet for plugging in user-defined Array >>> subtypes. I assume you've read >>> >>> >>> http://arrow.apache.org/docs/python/extending_types.html#defining-extension-types-user-defined-types >>> >>> There may be some JIRA issues already about this (defining subclasses >>> of pa.Array with custom behavior) -- since Joris has been working on >>> this I'm interested in more comments >>> >> >> Yes, there is https://issues.apache.org/jira/browse/ARROW-6176 for >> exactly this issue. >> What I proposed there is to allow one to subclass pyarrow.ExtensionArray >> and to attach this to an attribute on the custom ExtensionType (eg >> __arrow_ext_array_class__ in line with the other __arrow_ext_.. >> methods). That should allow to achieve similar functionality as what is >> available in Java I think. >> >> If that seems a good way to do this, I think we certainly welcome a PR >> for that (I can also look into it otherwise before 1.0). >> >> Joris >> >> >>> >>> On Mon, Oct 28, 2019 at 3:56 PM Justin Polchlopek >>> wrote: >>> > >>> > Hi! >>> > >>> > I've been working through understanding extension types in Arrow. >>> It's a great feature, and I've had no problems getting things working in >>> Java/Scala; however, Python has been a bit of a different story. Not that >>> I am unable to create and register extension types in Python, but rather >>> that I can't seem to recreate the functionality provided by the Java API's >>> ExtensionTypeVector class. >>> > >>> > In Java, ExtensionType::getNewVector() provides a clear pathway from >>> the registered type to output a vector in something other than the >>> underlying vector type, and I am at a loss for how to get this same >>> functionality in Python. Am I missing something? >>> > >>> > Thanks for any hints. >>> > -Justin >>> >>
[jira] [Created] (ARROW-7080) [Python][Parquet] Expose parquet field_id in Schema objects
Ted Gooch created ARROW-7080: Summary: [Python][Parquet] Expose parquet field_id in Schema objects Key: ARROW-7080 URL: https://issues.apache.org/jira/browse/ARROW-7080 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Ted Gooch I'm in the process of adding parquet read support to Iceberg([https://iceberg.apache.org/]), and we use the parquet field_ids as a consistent id when reading a parquet file to create a map between the current schema and the schema of the file being read. Unless I've missed something, it appears that field_id is not exposed in the python APIs in pyarrow._parquet.ParquetSchema nor is it available in pyarrow.lib.Schema. Would it be possible to add this to either of those two objects? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7079) [C++][Dataset] Implement ScalarAsStatisctics for non-primitive types
Francois Saint-Jacques created ARROW-7079: - Summary: [C++][Dataset] Implement ScalarAsStatisctics for non-primitive types Key: ARROW-7079 URL: https://issues.apache.org/jira/browse/ARROW-7079 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Francois Saint-Jacques Statistics are not extracted for the following (parquet) types - BYTE_ARRAY - FLBA - Any logical timestamps/dates -- This message was sent by Atlassian Jira (v8.3.4#803005)
Saving Binary Arrow memory objects as blobs in Cassandra
Is there anyway to save Arrow memory as a blob? I tried using Feather and Parquet, but neither one supports writing complex nested structures yet. I tried with the following test file. test.jsonl: {"a": 1, "b": "abc", "c": [1, 2], "d": {"e": true, "f": "1991-02-03"}, "g": [{"h": 1, "i": "a"}, {"h": 2, "i": "b"}]} {"a": 2, "b": "xyz", "c": [3, 4], "d": {"e": false, "f": "2010-01-15"}, "g": [{"h": 3, "i": "c"}, {"h": 2, "i": "d"}]} code: import pyarrow.json as json arrow_mem = json.read_json("test.jsonl") Trying something out.. Storing Arrow Data in Cassandra for fast retrieval with primary keys. Solr indexing the Arrow Data blob for Cassandra retrieval by primary key. This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2019 BlackRock, Inc. All rights reserved.
Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing
Just bumping this thread for more comments On Wed, Oct 30, 2019 at 3:11 PM Wes McKinney wrote: > > Returning to this discussion as there seems to lack consensus in the vote > thread > > Copying Micah's proposals in the VOTE thread here, I wanted to state > my opinions so we can discuss further and see where there is potential > disagreement > > 1. It is not required that all dictionary batches occur at the beginning > of the IPC stream format (if a the first record batch has an all null > dictionary encoded column, the null column's dictionary might not be sent > until later in the stream). > > This seems preferable to requiring a placeholder empty dictionary > batch. This does mean more to test but the integration tests will > force the issue > > 2. A second dictionary batch for the same ID that is not a "delta batch" > in an IPC stream indicates the dictionary should be replaced. > > Agree. > > 3. Clarifies that the file format, can only contain 1 "NON-delta" > dictionary batch and multiple "delta" dictionary batches. > > Agree -- it is also worth stating explicitly that dictionary > replacements are not allowed in the file format. > > In the file format, all the dictionaries must be "loaded" up front. > The code path for loading the dictionaries ideally should use nearly > the same code as the stream-reader code that sees follow-up dictionary > batches interspersed in the stream. The only downside is that it will > not be possible to exactly preserve the dictionary "state" as of each > record batch being written. > > So if we had a file containing > > DICTIONARY ID=0 > RECORD BATCH > RECORD BATCH > DICTIONARY DELTA ID=0 > RECORD BATCH > RECORD BATCH > > Then after processing/loading the dictionaries, the first two record > batches will have a dictionary that is "larger" (on account of the > delta) than when they were written. Since dictionaries are > fundamentally about data representation, they still represent the same > data so I think this is acceptable. > > 4. Add an enum to dictionary metadata for possible future changes in what > format dictionary batches can be sent. (the most likely would be an array > Map). An enum is needed as a place holder to allow for forward > compatibility past the release 1.0.0. > > I'm least sure about this but I do not think it is harmful to have a > forward-compatible "escape hatch" for future evolutions in dictionary > encoding. > > On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield wrote: > > > > I'll plan on starting a vote in the next day or two if there are no further > > objections/comments. > > > > On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield > > wrote: > > > > > I think the only point asked on the PR that I think is worth discussing is > > > assumptions about dictionaries at the beginning of streams. > > > > > > There are two options: > > > 1. Based on the current wording, it does not seem that all dictionaries > > > need to be at the beginning of the stream if they aren't made use of in > > > the > > > first record batch (i.e. a dictionary encoded column is all null in the > > > first record batch). > > > 2. We require a dictionary batch for each dictionary at the beginning of > > > the stream (and require implementations to send an empty batch if they > > > don't have the dictionary available). > > > > > > The current proposal in the PR is option #1. > > > > > > Thanks, > > > Micah > > > > > > On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield > > > wrote: > > > > > >> I've opened a pull request [1] to clarify some recent conversations about > > >> semantics/edge cases for dictionary encoding [2][3] around interleaved > > >> batches and when isDelta=False. > > >> > > >> Specifically, it proposes isDelta=False indicates dictionary > > >> replacement. For the file format, only one isDelta=False batch is > > >> allowed > > >> per file and isDelta=true batches are applied in the order supplied file > > >> footer. > > >> > > >> In addition, I've added a new enum to DictionaryEncoding to preserve > > >> future compatibility in case we want to expand dictionary encoding to be > > >> an > > >> explicit mapping from "ID" to "VALUE" as discussed in [4]. > > >> > > >> Once people have had a change to review and come to a consensus. I will > > >> call a formal vote to approve the change commit the change. > > >> > > >> Thanks, > > >> Micah > > >> > > >> [1] https://github.com/apache/arrow/pull/5585 > > >> [2] > > >> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E > > >> [3] > > >> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E > > >> [4] > > >> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E > > >> > > >>
[jira] [Created] (ARROW-7078) [Developer] Add Windows utility script to use Dependencies.exe to dump DLL dependencies for diagnostic purposes
Wes McKinney created ARROW-7078: --- Summary: [Developer] Add Windows utility script to use Dependencies.exe to dump DLL dependencies for diagnostic purposes Key: ARROW-7078 URL: https://issues.apache.org/jira/browse/ARROW-7078 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Developer Tools Reporter: Wes McKinney See https://lucasg.github.io/2018/04/29/Dependencies-command-line/ This would help us diagnose DLL load issues -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7077) [C++] Unsupported Dict->T cast crashes instead of returning error
Antoine Pitrou created ARROW-7077: - Summary: [C++] Unsupported Dict->T cast crashes instead of returning error Key: ARROW-7077 URL: https://issues.apache.org/jira/browse/ARROW-7077 Project: Apache Arrow Issue Type: Bug Components: C++, C++ - Compute Affects Versions: 0.15.1 Reporter: Antoine Pitrou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2019-11-06-0
Arrow Build Report for Job nightly-2019-11-06-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0 Failed Tasks: - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-travis-gandiva-jar-osx - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-travis-gandiva-jar-trusty Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-centos-8 - conda-linux-gcc-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-linux-gcc-py27 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-linux-gcc-py37 - conda-osx-clang-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-osx-clang-py27 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-osx-clang-py37 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-win-vs2015-py37 - debian-buster: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-debian-buster - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-debian-stretch - docker-c_glib: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-c_glib - docker-cpp-cmake32: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-cpp-cmake32 - docker-cpp-release: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-cpp-release - docker-cpp-static-only: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-cpp-static-only - docker-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-cpp - docker-dask-integration: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-dask-integration - docker-docs: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-docs - docker-go: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-go - docker-hdfs-integration: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-hdfs-integration - docker-iwyu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-iwyu - docker-java: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-java - docker-js: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-js - docker-lint: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-lint - docker-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-pandas-master - docker-python-2.7-nopandas: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-python-2.7-nopandas - docker-python-2.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-python-2.7 - docker-python-3.6-nopandas: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-python-3.6-nopandas - docker-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-python-3.6 - docker-python-3.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-python-3.7 - docker-r-conda: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-r-conda - docker-r: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-r - docker-rust: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-rust - docker-spark-integration:
[jira] [Created] (ARROW-7076) `pip install pyarrow` with python 3.8 fail with message : Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly
Fabien created ARROW-7076: - Summary: `pip install pyarrow` with python 3.8 fail with message : Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly Key: ARROW-7076 URL: https://issues.apache.org/jira/browse/ARROW-7076 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Environment: Ubuntu 19.10 / Python 3.8.0 Reporter: Fabien When I install pyarrow in python 3.7.5 with `pip install pyarrow` it works. However with python 3.8.0 it fails with the following error : {noformat} 14:06 $ pip install pyarrow Collecting pyarrow Using cached https://files.pythonhosted.org/packages/e0/e6/d14b4a2b54ef065b1a2c576537abe805c1af0c94caef70d365e2d78fc528/pyarrow-0.15.1.tar.gz Installing build dependencies ... done Getting requirements to build wheel ... done Preparing wheel metadata ... done Collecting numpy>=1.14 Using cached https://files.pythonhosted.org/packages/3a/8f/f9ee25c0ae608f86180c26a1e35fe7ea9d71b473ea7f54db20759ba2745e/numpy-1.17.3-cp38-cp38-manylinux1_x86_64.whl Collecting six>=1.0.0 Using cached https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl Building wheels for collected packages: pyarrow Building wheel for pyarrow (PEP 517) ... error ERROR: Command errored out with exit status 1: command: /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/bin/python3.8 /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /tmp/tmp4gpyu82j cwd: /tmp/pip-install-cj5ucedq/pyarrow Complete output (490 lines): running bdist_wheel running build running build_py creating build creating build/lib.linux-x86_64-3.8 creating build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/flight.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/orc.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/jvm.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/util.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/pandas_compat.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/cuda.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/filesystem.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/json.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/feather.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/serialization.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/ipc.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/parquet.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/_generated_version.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/benchmark.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/types.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/hdfs.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/fs.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/plasma.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/csv.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/compat.py -> build/lib.linux-x86_64-3.8/pyarrow copying pyarrow/__init__.py -> build/lib.linux-x86_64-3.8/pyarrow creating build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_strategies.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_array.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_tensor.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_json.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_cython.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_deprecations.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/conftest.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_memory.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_io.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/pandas_examples.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_compute.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/util.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_cuda_numba_interop.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_pandas.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_sparse_tensor.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_fs.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_schema.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_extension_type.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying pyarrow/tests/test_hdfs.py -> build/lib.linux-x86_64-3.8/pyarrow/tests copying
[jira] [Created] (ARROW-7074) [C++] ASSERT_OK_AND_ASSIGN crashes when failing
Antoine Pitrou created ARROW-7074: - Summary: [C++] ASSERT_OK_AND_ASSIGN crashes when failing Key: ARROW-7074 URL: https://issues.apache.org/jira/browse/ARROW-7074 Project: Apache Arrow Issue Type: Bug Components: C++, Developer Tools Affects Versions: 0.15.1 Reporter: Antoine Pitrou Instead of simply failing the test, the {{ASSERT_OK_AND_ASSIGN}} macro crashes when the operation failed, e.g.: {code} Value of: _st.ok() Actual: false Expected: true WARNING: Logging before InitGoogleLogging() is written to STDERR F1106 12:53:32.882110 4698 result.cc:28] ValueOrDie called on an error: XXX {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7073) [Java] Support concating vectors values in batch
Liya Fan created ARROW-7073: --- Summary: [Java] Support concating vectors values in batch Key: ARROW-7073 URL: https://issues.apache.org/jira/browse/ARROW-7073 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan We need a way to copy vector values in batch. Currently, we have copyFrom and copyFromSafe APIs. However, they are not enough, as copying values individually is not performant. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7072) [Java] Support concating validity bits efficiently
Liya Fan created ARROW-7072: --- Summary: [Java] Support concating validity bits efficiently Key: ARROW-7072 URL: https://issues.apache.org/jira/browse/ARROW-7072 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan For scenarios when we need to concate vectors (like the scenario in ARROW-7048, and delta dictionary), we need a way to concat validity bits. Currently, we have bit level API to read/write individual validity bit. However, it is not efficient , and we need a way to copy more bits at a time. -- This message was sent by Atlassian Jira (v8.3.4#803005)