[Java] Append multiple record batches together?

2019-11-06 Thread Micah Kornfield
Hi,
A colleague opened up https://issues.apache.org/jira/browse/ARROW-7048 for
having similar functionality to the python APIs that allow for creating one
larger data structure from a series of record batches.  I just wanted to
surface it here in case:
1.  An efficient solution already exists? It seems like TransferPair
implementations could possibly be improved upon or have they already been
optimized?
2.  What the preferred API for doing this would be?  Some options i can
think of:

* VectorSchemaRoot.concat(Collection)
* VectorSchemaRoot.from(Collection)
* VectorLoader.load(Collection)

Thanks,
Micah


[jira] [Created] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels

2019-11-06 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7083:
--

 Summary: [C++] Determine the feasibility and build a prototype to 
replace compute/kernels with gandiva kernels
 Key: ARROW-7083
 URL: https://issues.apache.org/jira/browse/ARROW-7083
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Micah Kornfield


See discussion on [https://issues.apache.org/jira/browse/ARROW-7017]

 

Requirements:

1.  No hard runtime dependency on LLVM

2.  Ability to run without JIT.

 

Open questions:

1.  What dependencies does this add to the build tool chain?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Result vs Status

2019-11-06 Thread Micah Kornfield
This seems reasonable to me.  Give the impact of the API changes I think it
might be worth keeping around for ~3 releases, but I think we are generally
slow to delete deprecated APIs anyways.

Any other thoughts on this?  i can try to open up some tracking JIRAs for
the work involved.

On Wed, Oct 30, 2019 at 1:25 PM Wes McKinney  wrote:

> Returning to this discussion.
>
> Here is my position on the matter since this was brought up on the
> sync call today
>
> * For internal / non-public and pseudo-non-public APIs that have
> return/out values
>   - Use Result or Status at discretion of the developer, but Result
> is preferable
>
> * For new public APIs with return/out values
>   - Prefer Result unless a Status-based API seems definitely less
> awkward in real world use. I have to say that I'm skeptical about the
> relative usability of std::tuple outputs and don't think we should
> force the use of Result for technical purity reasons
>
> * For existing Status APIs with return values
>   - Incrementally add Result APIs and deprecate Status-based APIs.
> Maintain deprecated Status APIs for ~2 major releases
>
> On Thu, Oct 24, 2019 at 5:16 PM Omer F. Ozarslan 
> wrote:
> >
> > Hi Micah,
> >
> > You're right. Quite possible that clang-query counted same function
> > separately for each include in each file. (I was iterating each file
> > separately, but providing all of them at once didn't change the result
> > either.)
> >
> > It's cool and wrong, so not very useful apparently. :-)
> >
> > Best,
> > Omer
> >
> > On Thu, Oct 24, 2019 at 4:51 PM Micah Kornfield 
> wrote:
> > >
> > > Hi Omer,
> > > I think this is really cool.  It is quite possible it was
> underestimated (I agree about line lengths), but I think the clang query is
> double counting somehow.
> > >
> > > For instance:
> > >
> > > "grep -r Status *" only returns ~9000 results in total for me.
> > >
> > > Similarly using grep for "FinishTyped" returns 18 results for me.
> Searching through the log that you linked seems to return 450 (for "Status
> FinishTyped").
> > >
> > > It is quite possible, I'm doing something naive with grep.
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Thu, Oct 24, 2019 at 2:41 PM Omer F. Ozarslan 
> wrote:
> > >>
> > >> Forgot to mention most of those lines are longer than line width while
> > >> out is usually (always?) last parameter, so probably that's why grep
> > >> possibly underestimates their number.
> > >>
> > >> On Thu, Oct 24, 2019 at 4:33 PM Omer F. Ozarslan 
> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > I don't have much experience on customized clang-tidy plugins, but
> > >> > this might be a good use case for such a plugin from what I read
> here
> > >> > and there (frankly this was a good excuse for me to have a look at
> > >> > clang tooling as well). I wanted to ensure it isn't obviously
> overkill
> > >> > before this suggestion: Running a clang query which lists functions
> > >> > returning `arrow::Status` and taking a pointer parameter named `out`
> > >> > showed that there are 13947 such functions in `cpp/src/**/*.h`. [1]
> > >> >
> > >> > I checked logs and it seemed legitimate to me, but please check it
> in
> > >> > case I missed something. If that's the case, it might be tedious to
> do
> > >> > this work manually.
> > >> >
> > >> > [1]: https://gist.github.com/ozars/ecbb1b8acd4a57ba4721c1965f83f342
> > >> > (Note that the log file is shown as truncated by github after ~30k
> > >> > lines)
> > >> >
> > >> > Best,
> > >> > Omer
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Oct 23, 2019 at 9:23 PM Micah Kornfield <
> emkornfi...@gmail.com> wrote:
> > >> > >
> > >> > > OK, it sounds like people want Result (at least in some
> circumstances).
> > >> > > Any thoughts on migrating old APIs and what to do for new APIs
> going
> > >> > > forward?
> > >> > >
> > >> > > A very rough approximation [1] yields the following counts by
> module:
> > >> > >
> > >> > >  853 arrow
> > >> > >
> > >> > >   17 gandiva
> > >> > >
> > >> > >   25 parquet
> > >> > >
> > >> > >   50 plasma
> > >> > >
> > >> > >
> > >> > >
> > >> > > [1] grep -r Status cpp/src/* |grep ".h:" | grep "\\*" |grep -v
> Accept |sed
> > >> > > s/:.*// | cut -f3 -d/ |sort
> > >> > >
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Micah
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Sat, Oct 19, 2019 at 7:50 PM Francois Saint-Jacques <
> > >> > > fsaintjacq...@gmail.com> wrote:
> > >> > >
> > >> > > > As mentioned, Result is an improvement for function which
> returns a
> > >> > > > single value, e.g. Make/Factory-like. My vote goes Result
> for such
> > >> > > > case. For multiple return types, we have std::tuple like Antoine
> > >> > > > proposed.
> > >> > > >
> > >> > > > François
> > >> > > >
> > >> > > > On Fri, Oct 18, 2019 at 9:19 PM Antoine Pitrou <
> anto...@python.org> wrote:
> > >> > > > >
> > >> > > > >
> > >> > > > > Le 18/10/2019 à 20:58, Wes McKinney a écrit :
> > >> > > > > > I'm definitely uncomfortable with the idea of 

Re: Saving Binary Arrow memory objects as blobs in Cassandra

2019-11-06 Thread Wes McKinney
I suggest you use the IPC protocol

http://arrow.apache.org/docs/python/ipc.html

This protocol will be considered stable starting with the 1.0.0
release but I would guess (without making any guarantees) that blobs
written with 0.15.1 will be readable in 1.0.0 and beyond.

On Wed, Nov 6, 2019 at 12:22 PM Lee, David  wrote:
>
> Is there anyway to save Arrow memory as a blob? I tried using Feather and 
> Parquet, but neither one supports writing complex nested structures yet.
>
> I tried with the following test file.
>
> test.jsonl:
> {"a": 1, "b": "abc", "c": [1, 2], "d": {"e": true, "f": "1991-02-03"}, "g": 
> [{"h": 1, "i": "a"}, {"h": 2, "i": "b"}]}
> {"a": 2, "b": "xyz", "c": [3, 4], "d": {"e": false, "f": "2010-01-15"}, "g": 
> [{"h": 3, "i": "c"}, {"h": 2, "i": "d"}]}
>
> code:
> import pyarrow.json as json
> arrow_mem = json.read_json("test.jsonl")
>
> Trying something out..
>
> Storing Arrow Data in Cassandra for fast retrieval with primary keys.
> Solr indexing the Arrow Data blob for Cassandra retrieval by primary key.
>
> This message may contain information that is confidential or privileged. If 
> you are not the intended recipient, please advise the sender immediately and 
> delete this message. See 
> http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
> information.  Please refer to 
> http://www.blackrock.com/corporate/compliance/privacy-policy for more 
> information about BlackRock’s Privacy Policy.
> For a list of BlackRock's office addresses worldwide, see 
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2019 BlackRock, Inc. All rights reserved.


[jira] [Created] (ARROW-7082) [Packaging][deb] Add apache-arrow-archive-keyring

2019-11-06 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-7082:
---

 Summary: [Packaging][deb] Add apache-arrow-archive-keyring
 Key: ARROW-7082
 URL: https://issues.apache.org/jira/browse/ARROW-7082
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7081) [R] Add methods for introspecting parquet files

2019-11-06 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7081:
---

 Summary: [R] Add methods for introspecting parquet files
 Key: ARROW-7081
 URL: https://issues.apache.org/jira/browse/ARROW-7081
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Neal Richardson
 Fix For: 1.0.0


Parquet files are very opaque, and it'd be handy to have an easy way to 
introspect them. Functions exist for loading them as a table, but information 
about row group level metadata and data page compression is hidden. Ideally, 
every structure from https://github.com/apache/parquet-format/#file-format 
could be examined in this fashion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Achieving parity with Java extension types in Python

2019-11-06 Thread Justin Polchlopek
Hi.  I'm looking into this issue and I have some questions as someone new
to the project.  The comment from Joris earlier in the thread suggests that
the solution here is to create an Array subclass for each extension type
that wants to use one.  This will give a nice symmetry w.r.t. the Java
interface, but in the Python case, this seems to suggest having to travel
some fairly byzantine code paths (rather quickly, we end up in C++ code,
where I lose the thread of what's happening—specifically as regards
`pyarrow_wrap_array`, as suggested in ARROW-6176).

I came up with a quick-and-dirty method wherein the ExtensionType subclass
simply provides a method to translate from the storage type to the output
type, and ExtensionArray has a __getitem__ implementation that passes the
element from storage through the translation function.  This doesn't feel
outside of the realm of what is often acceptable in the python world, but
it isn't nearly as typeful as Arrow seems to be leaning.  Plus, this feels
very far from what was intended in the issue, and I believe that I'm not
understanding the underlying design principles.

Can I get a bit of advice on this?

Thanks.
-J

On Tue, Oct 29, 2019 at 12:26 PM Justin Polchlopek 
wrote:

> That sounds about right.  We're doing some work here that might require
> this feature sooner than later, and if we decide to go the route that needs
> this improved support, I'd be happy to make this PR.  Thanks for showing
> that issue.  I'll be sure to tag any contribution with that ticket number.
>
> On Tue, Oct 29, 2019 at 9:01 AM Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
>>
>> On Mon, 28 Oct 2019 at 22:41, Wes McKinney  wrote:
>>
>>> Adding dev@
>>>
>>> I don't believe we have APIs yet for plugging in user-defined Array
>>> subtypes. I assume you've read
>>>
>>>
>>> http://arrow.apache.org/docs/python/extending_types.html#defining-extension-types-user-defined-types
>>>
>>> There may be some JIRA issues already about this (defining subclasses
>>> of pa.Array with custom behavior) -- since Joris has been working on
>>> this I'm interested in more comments
>>>
>>
>> Yes, there is https://issues.apache.org/jira/browse/ARROW-6176 for
>> exactly this issue.
>> What I proposed there is to allow one to subclass pyarrow.ExtensionArray
>> and to attach this to an attribute on the custom ExtensionType (eg
>> __arrow_ext_array_class__ in line with the other __arrow_ext_..
>> methods). That should allow to achieve similar functionality as what is
>> available in Java I think.
>>
>> If that seems a good way to do this, I think we certainly welcome a PR
>> for that (I can also look into it otherwise before 1.0).
>>
>> Joris
>>
>>
>>>
>>> On Mon, Oct 28, 2019 at 3:56 PM Justin Polchlopek
>>>  wrote:
>>> >
>>> > Hi!
>>> >
>>> > I've been working through understanding extension types in Arrow.
>>> It's a great feature, and I've had no problems getting things working in
>>> Java/Scala; however, Python has been a bit of a different story.  Not that
>>> I am unable to create and register extension types in Python, but rather
>>> that I can't seem to recreate the functionality provided by the Java API's
>>> ExtensionTypeVector class.
>>> >
>>> > In Java, ExtensionType::getNewVector() provides a clear pathway from
>>> the registered type to output a vector in something other than the
>>> underlying vector type, and I am at a loss for how to get this same
>>> functionality in Python.  Am I missing something?
>>> >
>>> > Thanks for any hints.
>>> > -Justin
>>>
>>


[jira] [Created] (ARROW-7080) [Python][Parquet] Expose parquet field_id in Schema objects

2019-11-06 Thread Ted Gooch (Jira)
Ted Gooch created ARROW-7080:


 Summary: [Python][Parquet] Expose parquet field_id in Schema 
objects
 Key: ARROW-7080
 URL: https://issues.apache.org/jira/browse/ARROW-7080
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Ted Gooch


I'm in the process of adding parquet read support to 
Iceberg([https://iceberg.apache.org/]), and we use the parquet field_ids as a 
consistent id when reading a parquet file to create a map between the current 
schema and the schema of the file being read.  Unless I've missed something, it 
appears that field_id is not exposed in the python APIs in 
pyarrow._parquet.ParquetSchema nor is it available in pyarrow.lib.Schema.

Would it be possible to add this to either of those two objects?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7079) [C++][Dataset] Implement ScalarAsStatisctics for non-primitive types

2019-11-06 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-7079:
-

 Summary: [C++][Dataset] Implement ScalarAsStatisctics for 
non-primitive types
 Key: ARROW-7079
 URL: https://issues.apache.org/jira/browse/ARROW-7079
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Dataset
Reporter: Francois Saint-Jacques


Statistics are not extracted for the following (parquet) types

- BYTE_ARRAY
- FLBA
- Any logical timestamps/dates



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Saving Binary Arrow memory objects as blobs in Cassandra

2019-11-06 Thread Lee, David
Is there anyway to save Arrow memory as a blob? I tried using Feather and 
Parquet, but neither one supports writing complex nested structures yet.

I tried with the following test file.

test.jsonl:
{"a": 1, "b": "abc", "c": [1, 2], "d": {"e": true, "f": "1991-02-03"}, "g": 
[{"h": 1, "i": "a"}, {"h": 2, "i": "b"}]}
{"a": 2, "b": "xyz", "c": [3, 4], "d": {"e": false, "f": "2010-01-15"}, "g": 
[{"h": 3, "i": "c"}, {"h": 2, "i": "d"}]}

code:
import pyarrow.json as json
arrow_mem = json.read_json("test.jsonl")

Trying something out..

Storing Arrow Data in Cassandra for fast retrieval with primary keys.
Solr indexing the Arrow Data blob for Cassandra retrieval by primary key.

This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.
For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2019 BlackRock, Inc. All rights reserved.


Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

2019-11-06 Thread Wes McKinney
Just bumping this thread for more comments

On Wed, Oct 30, 2019 at 3:11 PM Wes McKinney  wrote:
>
> Returning to this discussion as there seems to lack consensus in the vote 
> thread
>
> Copying Micah's proposals in the VOTE thread here, I wanted to state
> my opinions so we can discuss further and see where there is potential
> disagreement
>
> 1.  It is not required that all dictionary batches occur at the beginning
> of the IPC stream format (if a the first record batch has an all null
> dictionary encoded column, the null column's dictionary might not be sent
> until later in the stream).
>
> This seems preferable to requiring a placeholder empty dictionary
> batch. This does mean more to test but the integration tests will
> force the issue
>
> 2.  A second dictionary batch for the same ID that is not a "delta batch"
> in an IPC stream indicates the dictionary should be replaced.
>
> Agree.
>
> 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> dictionary batch and multiple "delta" dictionary batches.
>
> Agree -- it is also worth stating explicitly that dictionary
> replacements are not allowed in the file format.
>
> In the file format, all the dictionaries must be "loaded" up front.
> The code path for loading the dictionaries ideally should use nearly
> the same code as the stream-reader code that sees follow-up dictionary
> batches interspersed in the stream. The only downside is that it will
> not be possible to exactly preserve the dictionary "state" as of each
> record batch being written.
>
> So if we had a file containing
>
> DICTIONARY ID=0
> RECORD BATCH
> RECORD BATCH
> DICTIONARY DELTA ID=0
> RECORD BATCH
> RECORD BATCH
>
> Then after processing/loading the dictionaries, the first two record
> batches will have a dictionary that is "larger" (on account of the
> delta) than when they were written. Since dictionaries are
> fundamentally about data representation, they still represent the same
> data so I think this is acceptable.
>
> 4.  Add an enum to dictionary metadata for possible future changes in what
> format dictionary batches can be sent. (the most likely would be an array
> Map).  An enum is needed as a place holder to allow for forward
> compatibility past the release 1.0.0.
>
> I'm least sure about this but I do not think it is harmful to have a
> forward-compatible "escape hatch" for future evolutions in dictionary
> encoding.
>
> On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield  wrote:
> >
> > I'll plan on starting a vote in the next day or two if there are no further
> > objections/comments.
> >
> > On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield 
> > wrote:
> >
> > > I think the only point asked on the PR that I think is worth discussing is
> > > assumptions about dictionaries at the beginning of streams.
> > >
> > > There are two options:
> > > 1.  Based on the current wording, it does not seem that all dictionaries
> > > need to be at the beginning of the stream if they aren't made use of in 
> > > the
> > > first record batch (i.e. a dictionary encoded column is all null in the
> > > first record batch).
> > > 2.  We require a dictionary batch for each dictionary at the beginning of
> > > the stream (and require implementations to send an empty batch if they
> > > don't have the dictionary available).
> > >
> > > The current proposal in the PR is option #1.
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield 
> > > wrote:
> > >
> > >> I've opened a pull request [1] to clarify some recent conversations about
> > >> semantics/edge cases for dictionary encoding [2][3] around interleaved
> > >> batches and when isDelta=False.
> > >>
> > >> Specifically, it proposes isDelta=False indicates dictionary
> > >> replacement.  For the file format, only one isDelta=False batch is 
> > >> allowed
> > >> per file and isDelta=true batches are applied in the order supplied file
> > >> footer.
> > >>
> > >> In addition, I've added a new enum to DictionaryEncoding to preserve
> > >> future compatibility in case we want to expand dictionary encoding to be 
> > >> an
> > >> explicit mapping from "ID" to "VALUE" as discussed in [4].
> > >>
> > >> Once people have had a change to review and come to a consensus. I will
> > >> call a formal vote to approve the change commit the change.
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> [1] https://github.com/apache/arrow/pull/5585
> > >> [2]
> > >> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E
> > >> [3]
> > >> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E
> > >> [4]
> > >> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E
> > >>
> > >>


[jira] [Created] (ARROW-7078) [Developer] Add Windows utility script to use Dependencies.exe to dump DLL dependencies for diagnostic purposes

2019-11-06 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-7078:
---

 Summary: [Developer] Add Windows utility script to use 
Dependencies.exe to dump DLL dependencies for diagnostic purposes
 Key: ARROW-7078
 URL: https://issues.apache.org/jira/browse/ARROW-7078
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Developer Tools
Reporter: Wes McKinney


See

https://lucasg.github.io/2018/04/29/Dependencies-command-line/

This would help us diagnose DLL load issues



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7077) [C++] Unsupported Dict->T cast crashes instead of returning error

2019-11-06 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7077:
-

 Summary: [C++] Unsupported Dict->T cast crashes instead of 
returning error
 Key: ARROW-7077
 URL: https://issues.apache.org/jira/browse/ARROW-7077
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, C++ - Compute
Affects Versions: 0.15.1
Reporter: Antoine Pitrou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2019-11-06-0

2019-11-06 Thread Crossbow


Arrow Build Report for Job nightly-2019-11-06-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0

Failed Tasks:
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-travis-gandiva-jar-osx
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-travis-gandiva-jar-trusty

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-centos-8
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-linux-gcc-py27
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-linux-gcc-py37
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-osx-clang-py27
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-osx-clang-py37
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-conda-win-vs2015-py37
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-debian-buster
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-azure-debian-stretch
- docker-c_glib:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-c_glib
- docker-cpp-cmake32:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-cpp-cmake32
- docker-cpp-release:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-cpp-release
- docker-cpp-static-only:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-cpp-static-only
- docker-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-cpp
- docker-dask-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-dask-integration
- docker-docs:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-docs
- docker-go:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-go
- docker-hdfs-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-hdfs-integration
- docker-iwyu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-iwyu
- docker-java:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-java
- docker-js:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-js
- docker-lint:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-lint
- docker-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-pandas-master
- docker-python-2.7-nopandas:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-python-2.7-nopandas
- docker-python-2.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-python-2.7
- docker-python-3.6-nopandas:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-python-3.6-nopandas
- docker-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-python-3.6
- docker-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-python-3.7
- docker-r-conda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-r-conda
- docker-r:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-r
- docker-rust:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-06-0-circle-docker-rust
- docker-spark-integration:
  

[jira] [Created] (ARROW-7076) `pip install pyarrow` with python 3.8 fail with message : Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly

2019-11-06 Thread Fabien (Jira)
Fabien created ARROW-7076:
-

 Summary: `pip install pyarrow` with python 3.8 fail with message : 
Could not build wheels for pyarrow which use PEP 517 and cannot be installed 
directly
 Key: ARROW-7076
 URL: https://issues.apache.org/jira/browse/ARROW-7076
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1
 Environment: Ubuntu 19.10 / Python 3.8.0
Reporter: Fabien


When I install pyarrow in python 3.7.5 with `pip install pyarrow` it works.

However with python 3.8.0 it fails with the following error :
{noformat}
14:06 $ pip install pyarrow
Collecting pyarrow
 Using cached 
https://files.pythonhosted.org/packages/e0/e6/d14b4a2b54ef065b1a2c576537abe805c1af0c94caef70d365e2d78fc528/pyarrow-0.15.1.tar.gz
 Installing build dependencies ... done
 Getting requirements to build wheel ... done
 Preparing wheel metadata ... done
Collecting numpy>=1.14
 Using cached 
https://files.pythonhosted.org/packages/3a/8f/f9ee25c0ae608f86180c26a1e35fe7ea9d71b473ea7f54db20759ba2745e/numpy-1.17.3-cp38-cp38-manylinux1_x86_64.whl
Collecting six>=1.0.0
 Using cached 
https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl
Building wheels for collected packages: pyarrow
 Building wheel for pyarrow (PEP 517) ... error
 ERROR: Command errored out with exit status 1:
 command: /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/bin/python3.8 
/home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py
 build_wheel /tmp/tmp4gpyu82j
 cwd: /tmp/pip-install-cj5ucedq/pyarrow
 Complete output (490 lines):
 running bdist_wheel
 running build
 running build_py
 creating build
 creating build/lib.linux-x86_64-3.8
 creating build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/flight.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/orc.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/jvm.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/util.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/pandas_compat.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/cuda.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/filesystem.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/json.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/feather.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/serialization.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/ipc.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/parquet.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/_generated_version.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/benchmark.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/types.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/hdfs.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/fs.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/plasma.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/csv.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/compat.py -> build/lib.linux-x86_64-3.8/pyarrow
 copying pyarrow/__init__.py -> build/lib.linux-x86_64-3.8/pyarrow
 creating build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_strategies.py -> 
build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_array.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_tensor.py -> 
build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_json.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_cython.py -> 
build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_deprecations.py -> 
build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/conftest.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_memory.py -> 
build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_io.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/pandas_examples.py -> 
build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_compute.py -> 
build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/util.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_cuda_numba_interop.py -> 
build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_pandas.py -> 
build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_sparse_tensor.py -> 
build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_fs.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_schema.py -> 
build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_extension_type.py -> 
build/lib.linux-x86_64-3.8/pyarrow/tests
 copying pyarrow/tests/test_hdfs.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
 copying 

[jira] [Created] (ARROW-7074) [C++] ASSERT_OK_AND_ASSIGN crashes when failing

2019-11-06 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7074:
-

 Summary: [C++] ASSERT_OK_AND_ASSIGN crashes when failing
 Key: ARROW-7074
 URL: https://issues.apache.org/jira/browse/ARROW-7074
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Developer Tools
Affects Versions: 0.15.1
Reporter: Antoine Pitrou


Instead of simply failing the test, the {{ASSERT_OK_AND_ASSIGN}} macro crashes 
when the operation failed, e.g.:
{code}
Value of: _st.ok()
  Actual: false
Expected: true
WARNING: Logging before InitGoogleLogging() is written to STDERR
F1106 12:53:32.882110  4698 result.cc:28] ValueOrDie called on an error:  XXX
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7073) [Java] Support concating vectors values in batch

2019-11-06 Thread Liya Fan (Jira)
Liya Fan created ARROW-7073:
---

 Summary: [Java] Support concating vectors values in batch
 Key: ARROW-7073
 URL: https://issues.apache.org/jira/browse/ARROW-7073
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We need a way to copy vector values in batch. Currently, we have copyFrom and 
copyFromSafe APIs. However, they are not enough, as copying values individually 
is not performant. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7072) [Java] Support concating validity bits efficiently

2019-11-06 Thread Liya Fan (Jira)
Liya Fan created ARROW-7072:
---

 Summary: [Java] Support concating validity bits efficiently
 Key: ARROW-7072
 URL: https://issues.apache.org/jira/browse/ARROW-7072
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


For scenarios when we need to concate vectors (like the scenario in ARROW-7048, 
and delta dictionary), we need a way to concat validity bits. 

Currently, we have bit level API to read/write individual validity bit. 
However, it is not efficient , and we need a way to copy more bits at a time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)