Java: DefaultVectorComparators - invalid implementation

2020-04-09 Thread Martin Janda

I made first look to Apache Arrow Java sources.

I found wrong implementation for DefaultVectorComparators.LongComparator 
I suppose that other comparators can be wrong too.


Simple test:

long l1=Long.MIN_VALUE +1L;
long l2=Long.MAX_VALUE;

System.out.println("Arrow: " + Long.signum(l1 - l2));
System.out.println("Java : " + Long.compare(l1, l2));

Result:
Arrow: 1
Java : -1

I think that there should be added tests for corner cases. And I suggest to 
replace arrow implementation with Long.compare, Integer.compare...
 that provides correct results.
  Thank you
   Martin

PS I have no access to JIRA to report this issue.



[jira] [Created] (ARROW-8388) [C++] GCC 4.8 fails to move on return

2020-04-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8388:
---

 Summary: [C++] GCC 4.8 fails to move on return
 Key: ARROW-8388
 URL: https://issues.apache.org/jira/browse/ARROW-8388
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 0.17.0


See https://github.com/apache/arrow/pull/6883#issuecomment-611661733

This is a recurring problem which usually shows up as a broken nightly (the 
gandiva nightly jobs, specifically) along with similar issues due to gcc 4.8's 
incomplete handling of c++11. As long as someone depends on these we should 
probably have an every-commit CI job which checks we haven't introduced such a 
breakage



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8387) [rust] Make schema_to_fb public because it is very useful!

2020-04-09 Thread Max Burke (Jira)
Max Burke created ARROW-8387:


 Summary: [rust] Make schema_to_fb public because it is very useful!
 Key: ARROW-8387
 URL: https://issues.apache.org/jira/browse/ARROW-8387
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Max Burke


Make schema_to_fb public because it is very useful!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Python] black vs. autopep8

2020-04-09 Thread Joris Van den Bossche
> > So autopep8 doesn't fix everything? Sounds inferior to me. That said, I'm
> > in favor of any resolution that increases our automation of this and
> > decreases the energy we expend debating it.
>
> It does fix everything, where "everything" is compliance with PEP8,
> which I think is the thing we are most interested in.
>

Note that autopep8 does not fix everything, even for PEP8 compliance.
At least, it can often not automatically fix "too long line length" for
code lines (that was actually the trigger for me to start the thread about
adopting black).


Re: mutual TLS auth support with arrow flight

2020-04-09 Thread David Li
Hey David,

This isn't exposed right now. You'd have to expose the gRPC option on
the client and server sides; right now while Flight does set up SSL
credentials when TLS is enabled, it's only to allow you to set the
root certificate on the client [1] and the server certificate [2].
There is already support for custom authentication methods, though, if
you're amenable to something other than mTLS (e.g. username/password
or auth token).

If you're interested in contributing this, I think it should be fairly
straightforward - you'd just need to add options that get passed
through to gRPC - though you'd have to also expose the option to
Python.

For Java servers, you can use the flight-grpc artifact [3] to obtain a
"plain" gRPC service from your Flight Producer implementation, which
you can then attach to a gRPC server that you've configured with mTLS.
Unfortunately this convenience isn't (reasonably) possible to
implement in Python with the way that gRPC-C++ and gRPC-Python are
designed.

Best,
David

[1]: 
https://github.com/apache/arrow/blob/2914899326d50d3e2853f5ecbd165028d862378f/cpp/src/arrow/flight/client.cc#L538-L542
[2]: 
https://github.com/apache/arrow/blob/2914899326d50d3e2853f5ecbd165028d862378f/cpp/src/arrow/flight/server.cc#L674-L678
[3]: https://search.maven.org/search?q=g:org.apache.arrow%20AND%20a:flight-grpc

On 4/9/20, David Seapy  wrote:
> grpc supports connections using mutual TLS with client and server
> certificates. Is there an example of how to do this with arrow flight
> libraries, or does one need to step down to the grpc-level when making
> requests?
>
> Specifically we are working on having data-scientists establish a
> connection with our scala flight server from their python client.  If it
> is not currently supported but is a feature that the community would
> benefit from, then maybe we can take a stab at adding support for this.
>
> Any advice or pointers would be appreciated.
>
> Thanks!
>
> - David Seapy
>
>


[jira] [Created] (ARROW-8391) [C++] Implement row range read API for IPC file (and Feather)

2020-04-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8391:
---

 Summary: [C++] Implement row range read API for IPC file (and 
Feather)
 Key: ARROW-8391
 URL: https://issues.apache.org/jira/browse/ARROW-8391
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


The objective would be able to read a range of rows from the middle of a file. 
It's not as easy as it might sound since all the record batch metadata must be 
examined to determine the start and end point of the row range



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8389) [Integration] Run tests in parallel

2020-04-09 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8389:
-

 Summary: [Integration] Run tests in parallel
 Key: ARROW-8389
 URL: https://issues.apache.org/jira/browse/ARROW-8389
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Integration
Reporter: Antoine Pitrou


This follows ARROW-8176.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Rust] Dictionary encoding and Flight

2020-04-09 Thread Paul Dix
I'd be happy to pitch in on getting the integration tests developed. It
would certainly beat my current method of building and running my test
project and switching over to a Jupyter notebook to manually check it.

Is there any prior work in the Rust project that I could basically copy
from? Or perhaps the C++ implementation?

On Thu, Apr 9, 2020 at 6:31 PM Wes McKinney  wrote:

> hi Paul,
>
> Dictionary-encoded is not a nested type, so there shouldn't be any
> children -- the IPC layout of a dictionary encoded field is that same
> as the type of the indices (probably want to change the terminology in
> the Rust library from "keys" to "indices" which is what's used in the
> specification). And the dictionaries are sent separately in record
> batches
>
> > The Rust implementation of DictionaryArray keeps the values as a child
> to the keys array.
>
> What we do in C++ is that the dictionary is a separate member of the
> ArrayData data structure. I don't know enough about the Rust library
> to know whether that's the right design choice or not.
>
> Implementing Rust support for integration tests would make it much
> easier to validate that things are implemented correctly. Does anyone
> know how far away the Rust library might be from having them? It's not
> the most fun thing to implement, but it's very important.
>
> - Wes
>
> On Thu, Apr 9, 2020 at 5:08 PM Paul Dix  wrote:
> >
> > I managed to get something up and running. I ended up creating a
> > dictionary_batch.rs and adding that to convert.rs to translate
> dictionary
> > fields in a schema over to the correct fb thing. I also added a method to
> > writer.rs to convert that to bytes so it can be sent via ipc. However,
> when
> > writing out the subsequent record batches that contain a dictionary field
> > that references the dictionary sent in the dictionary batch, it runs
> into a
> > problem. The Rust implementation of DictionaryArray keeps the values as a
> > child to the keys array.
> >
> > You can see the chain of calls when calling finish on the
> > StringDictionaryBuilder
> >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316
> > which calls out to finish the dictionary
> >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482
> > which then calls from to create the DictionaryArray
> >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927
> > And that last part validates that the child is there.
> >
> > And here in the writer is where it blindly writes out any children:
> >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378
> >
> > I found that when I sent this out, the Python reader on the other side
> > would choke on it when I tried to read_pandas. However, if I skip
> > serializing the children for DictionaryArray only, then everything works
> > exactly as expected.
> >
> > So I'm not sure what the right thing here is because I don't know if
> > DictionaryArrays should not include the children (i.e. values) when being
> > written out or if there's some other thing that should be happening.
> Should
> > the raw arrow data have the values in that dictionary? Or does it get
> > reconstituted manually on the other side from the DictionaryBatch?
> >
> > On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney 
> wrote:
> >
> > > As another item for consideration -- in C++ at least, the dictionary
> > > id is dealt with as an internal detail of the IPC message production
> > > process. When serializing the Schema, id's are assigned to each
> > > dictionary-encoded field in the DictionaryMemo object, see
> > >
> > >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h
> > >
> > > When record batches are reconstructed, the dictionary corresponding to
> > > an id at the time of reconstruction is set in the Array's internal
> > > data -- that's the "dictionary" member of the ArrayData object
> > > (
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).
> > >
> > > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney 
> wrote:
> > > >
> > > > hey Paul,
> > > >
> > > > Take a look at how dictionaries work in the IPC protocol
> > > >
> > > >
> > >
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
> > > >
> > > > Dictionaries are sent as separate messages. When a field is tagged as
> > > > dictionary encoded in the schema, the IPC reader must keep track of
> > > > the dictionaries it's seen come across the protocol and then set them
> > > > in the reconstructed record batches when a record batch comes
> through.
> > > >
> > > > Note that the protocol now supports dictionary deltas (dictionaries
> > > > can be appended to by subsequent messages for the same dictionary id)
> > > > and replacements (new dictionary for an id).
> > > >
> > > > I don't know what the status of handling dictionaries in the Rust
> IPC,
> > 

Re: [Rust] Dictionary encoding and Flight

2020-04-09 Thread Wes McKinney
Well, luckily we have some newly spruced up documentation about how
integration testing works (thanks Neal!)

https://github.com/apache/arrow/blob/master/docs/source/format/Integration.rst

The main task is writing a parser for the JSON format used for
integration testing. The JSON is used to communicate the "truth" about
the data between two implementations. There are implementations of
this in C++, Go, JS, and Java, so any of those implementations could
be used as your guide.

One small reminder for others reading: the current JSON integration
testing format doesn't yet capture dictionary deltas/replacements. See
ARROW-5338

On Thu, Apr 9, 2020 at 5:55 PM Paul Dix  wrote:
>
> I'd be happy to pitch in on getting the integration tests developed. It
> would certainly beat my current method of building and running my test
> project and switching over to a Jupyter notebook to manually check it.
>
> Is there any prior work in the Rust project that I could basically copy
> from? Or perhaps the C++ implementation?
>
> On Thu, Apr 9, 2020 at 6:31 PM Wes McKinney  wrote:
>
> > hi Paul,
> >
> > Dictionary-encoded is not a nested type, so there shouldn't be any
> > children -- the IPC layout of a dictionary encoded field is that same
> > as the type of the indices (probably want to change the terminology in
> > the Rust library from "keys" to "indices" which is what's used in the
> > specification). And the dictionaries are sent separately in record
> > batches
> >
> > > The Rust implementation of DictionaryArray keeps the values as a child
> > to the keys array.
> >
> > What we do in C++ is that the dictionary is a separate member of the
> > ArrayData data structure. I don't know enough about the Rust library
> > to know whether that's the right design choice or not.
> >
> > Implementing Rust support for integration tests would make it much
> > easier to validate that things are implemented correctly. Does anyone
> > know how far away the Rust library might be from having them? It's not
> > the most fun thing to implement, but it's very important.
> >
> > - Wes
> >
> > On Thu, Apr 9, 2020 at 5:08 PM Paul Dix  wrote:
> > >
> > > I managed to get something up and running. I ended up creating a
> > > dictionary_batch.rs and adding that to convert.rs to translate
> > dictionary
> > > fields in a schema over to the correct fb thing. I also added a method to
> > > writer.rs to convert that to bytes so it can be sent via ipc. However,
> > when
> > > writing out the subsequent record batches that contain a dictionary field
> > > that references the dictionary sent in the dictionary batch, it runs
> > into a
> > > problem. The Rust implementation of DictionaryArray keeps the values as a
> > > child to the keys array.
> > >
> > > You can see the chain of calls when calling finish on the
> > > StringDictionaryBuilder
> > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316
> > > which calls out to finish the dictionary
> > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482
> > > which then calls from to create the DictionaryArray
> > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927
> > > And that last part validates that the child is there.
> > >
> > > And here in the writer is where it blindly writes out any children:
> > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378
> > >
> > > I found that when I sent this out, the Python reader on the other side
> > > would choke on it when I tried to read_pandas. However, if I skip
> > > serializing the children for DictionaryArray only, then everything works
> > > exactly as expected.
> > >
> > > So I'm not sure what the right thing here is because I don't know if
> > > DictionaryArrays should not include the children (i.e. values) when being
> > > written out or if there's some other thing that should be happening.
> > Should
> > > the raw arrow data have the values in that dictionary? Or does it get
> > > reconstituted manually on the other side from the DictionaryBatch?
> > >
> > > On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney 
> > wrote:
> > >
> > > > As another item for consideration -- in C++ at least, the dictionary
> > > > id is dealt with as an internal detail of the IPC message production
> > > > process. When serializing the Schema, id's are assigned to each
> > > > dictionary-encoded field in the DictionaryMemo object, see
> > > >
> > > >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h
> > > >
> > > > When record batches are reconstructed, the dictionary corresponding to
> > > > an id at the time of reconstruction is set in the Array's internal
> > > > data -- that's the "dictionary" member of the ArrayData object
> > > > (
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).
> > > >
> > > > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney 
> > wrote:
> > > > >

Re: [Rust] Dictionary encoding and Flight

2020-04-09 Thread Wes McKinney
hi Paul,

Dictionary-encoded is not a nested type, so there shouldn't be any
children -- the IPC layout of a dictionary encoded field is that same
as the type of the indices (probably want to change the terminology in
the Rust library from "keys" to "indices" which is what's used in the
specification). And the dictionaries are sent separately in record
batches

> The Rust implementation of DictionaryArray keeps the values as a child to the 
> keys array.

What we do in C++ is that the dictionary is a separate member of the
ArrayData data structure. I don't know enough about the Rust library
to know whether that's the right design choice or not.

Implementing Rust support for integration tests would make it much
easier to validate that things are implemented correctly. Does anyone
know how far away the Rust library might be from having them? It's not
the most fun thing to implement, but it's very important.

- Wes

On Thu, Apr 9, 2020 at 5:08 PM Paul Dix  wrote:
>
> I managed to get something up and running. I ended up creating a
> dictionary_batch.rs and adding that to convert.rs to translate dictionary
> fields in a schema over to the correct fb thing. I also added a method to
> writer.rs to convert that to bytes so it can be sent via ipc. However, when
> writing out the subsequent record batches that contain a dictionary field
> that references the dictionary sent in the dictionary batch, it runs into a
> problem. The Rust implementation of DictionaryArray keeps the values as a
> child to the keys array.
>
> You can see the chain of calls when calling finish on the
> StringDictionaryBuilder
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316
> which calls out to finish the dictionary
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482
> which then calls from to create the DictionaryArray
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927
> And that last part validates that the child is there.
>
> And here in the writer is where it blindly writes out any children:
> https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378
>
> I found that when I sent this out, the Python reader on the other side
> would choke on it when I tried to read_pandas. However, if I skip
> serializing the children for DictionaryArray only, then everything works
> exactly as expected.
>
> So I'm not sure what the right thing here is because I don't know if
> DictionaryArrays should not include the children (i.e. values) when being
> written out or if there's some other thing that should be happening. Should
> the raw arrow data have the values in that dictionary? Or does it get
> reconstituted manually on the other side from the DictionaryBatch?
>
> On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney  wrote:
>
> > As another item for consideration -- in C++ at least, the dictionary
> > id is dealt with as an internal detail of the IPC message production
> > process. When serializing the Schema, id's are assigned to each
> > dictionary-encoded field in the DictionaryMemo object, see
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h
> >
> > When record batches are reconstructed, the dictionary corresponding to
> > an id at the time of reconstruction is set in the Array's internal
> > data -- that's the "dictionary" member of the ArrayData object
> > (https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).
> >
> > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney  wrote:
> > >
> > > hey Paul,
> > >
> > > Take a look at how dictionaries work in the IPC protocol
> > >
> > >
> > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
> > >
> > > Dictionaries are sent as separate messages. When a field is tagged as
> > > dictionary encoded in the schema, the IPC reader must keep track of
> > > the dictionaries it's seen come across the protocol and then set them
> > > in the reconstructed record batches when a record batch comes through.
> > >
> > > Note that the protocol now supports dictionary deltas (dictionaries
> > > can be appended to by subsequent messages for the same dictionary id)
> > > and replacements (new dictionary for an id).
> > >
> > > I don't know what the status of handling dictionaries in the Rust IPC,
> > > but it would be a good idea to take time to take into account the
> > > above details.
> > >
> > > Finally, note that Rust is not participating in either the regular IPC
> > > nor Flight integration tests. This is an important milestone to being
> > > able to depend on the Rust library in production.
> > >
> > > Thanks
> > > Wes
> > >
> > > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix  wrote:
> > > >
> > > > Hello,
> > > > I'm trying to build a Rust based Flight server and I'd like to use
> > > > Dictionary encoding for a number of string columns in my data. I've
> > seen
> > > > 

mutual TLS auth support with arrow flight

2020-04-09 Thread David Seapy
grpc supports connections using mutual TLS with client and server 
certificates. Is there an example of how to do this with arrow flight 
libraries, or does one need to step down to the grpc-level when making 
requests?


Specifically we are working on having data-scientists establish a 
connection with our scala flight server from their python client.  If it 
is not currently supported but is a feature that the community would 
benefit from, then maybe we can take a stab at adding support for this.


Any advice or pointers would be appreciated.

Thanks!

- David Seapy



[jira] [Created] (ARROW-8390) [R] Expose schema unification features

2020-04-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8390:
--

 Summary: [R] Expose schema unification features
 Key: ARROW-8390
 URL: https://issues.apache.org/jira/browse/ARROW-8390
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Rust] Dictionary encoding and Flight

2020-04-09 Thread Paul Dix
I managed to get something up and running. I ended up creating a
dictionary_batch.rs and adding that to convert.rs to translate dictionary
fields in a schema over to the correct fb thing. I also added a method to
writer.rs to convert that to bytes so it can be sent via ipc. However, when
writing out the subsequent record batches that contain a dictionary field
that references the dictionary sent in the dictionary batch, it runs into a
problem. The Rust implementation of DictionaryArray keeps the values as a
child to the keys array.

You can see the chain of calls when calling finish on the
StringDictionaryBuilder
https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316
which calls out to finish the dictionary
https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482
which then calls from to create the DictionaryArray
https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927
And that last part validates that the child is there.

And here in the writer is where it blindly writes out any children:
https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378

I found that when I sent this out, the Python reader on the other side
would choke on it when I tried to read_pandas. However, if I skip
serializing the children for DictionaryArray only, then everything works
exactly as expected.

So I'm not sure what the right thing here is because I don't know if
DictionaryArrays should not include the children (i.e. values) when being
written out or if there's some other thing that should be happening. Should
the raw arrow data have the values in that dictionary? Or does it get
reconstituted manually on the other side from the DictionaryBatch?

On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney  wrote:

> As another item for consideration -- in C++ at least, the dictionary
> id is dealt with as an internal detail of the IPC message production
> process. When serializing the Schema, id's are assigned to each
> dictionary-encoded field in the DictionaryMemo object, see
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h
>
> When record batches are reconstructed, the dictionary corresponding to
> an id at the time of reconstruction is set in the Array's internal
> data -- that's the "dictionary" member of the ArrayData object
> (https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).
>
> On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney  wrote:
> >
> > hey Paul,
> >
> > Take a look at how dictionaries work in the IPC protocol
> >
> >
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
> >
> > Dictionaries are sent as separate messages. When a field is tagged as
> > dictionary encoded in the schema, the IPC reader must keep track of
> > the dictionaries it's seen come across the protocol and then set them
> > in the reconstructed record batches when a record batch comes through.
> >
> > Note that the protocol now supports dictionary deltas (dictionaries
> > can be appended to by subsequent messages for the same dictionary id)
> > and replacements (new dictionary for an id).
> >
> > I don't know what the status of handling dictionaries in the Rust IPC,
> > but it would be a good idea to take time to take into account the
> > above details.
> >
> > Finally, note that Rust is not participating in either the regular IPC
> > nor Flight integration tests. This is an important milestone to being
> > able to depend on the Rust library in production.
> >
> > Thanks
> > Wes
> >
> > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix  wrote:
> > >
> > > Hello,
> > > I'm trying to build a Rust based Flight server and I'd like to use
> > > Dictionary encoding for a number of string columns in my data. I've
> seen
> > > that StringDictionary was recently added to Rust here:
> > >
> https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab
> .
> > >
> > > However, that doesn't seem to reach down into Flight. When I attempt to
> > > send a schema through flight that has a Dictionary it
> throws
> > > an error when attempting to convert from the Rust type to the
> Flatbuffer
> > > field type. I figured I'd take a swing at adding that to convert.rs
> here:
> > >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319
> > >
> > > However, when I look at the definitions in Schema.fbs and the related
> > > generated Rust file, Dictionary isn't a type there. Should I be sending
> > > this down as some other composed type? And if so, how does this look
> at the
> > > client side of things? In my test I'm connecting to the Flight server
> via
> > > PyArrow and working with it in Pandas so I'm hoping that it will be
> able to
> > > consume Dictionary fields.
> > >
> > > Separately, the Rust field type doesn't have a spot for the dictionary
> ID,
> > > which I 

Re: Java: DefaultVectorComparators - invalid implementation

2020-04-09 Thread Fan Liya
Hi Martin,

Thank you so much for reporting this problem.

In the current implementation, we do not consider corner cases related to
integer overflow, and this problem should be fixed.

I have opened an issue to track this problem [1].
Do you want to provide a patch for it?

Best,
Liya Fan

[1] https://issues.apache.org/jira/browse/ARROW-8392

On Fri, Apr 10, 2020 at 12:55 AM Martin Janda  wrote:

> I made first look to Apache Arrow Java sources.
>
> I found wrong implementation for DefaultVectorComparators.LongComparator
> I suppose that other comparators can be wrong too.
>
> Simple test:
>
> long l1=Long.MIN_VALUE +1L;
> long l2=Long.MAX_VALUE;
>
> System.out.println("Arrow: " + Long.signum(l1 - l2));
> System.out.println("Java : " + Long.compare(l1, l2));
>
> Result:
> Arrow: 1
> Java : -1
>
> I think that there should be added tests for corner cases. And I suggest
> to replace arrow implementation with Long.compare, Integer.compare...
>   that provides correct results.
>Thank you
> Martin
>
> PS I have no access to JIRA to report this issue.
>
>


[jira] [Created] (ARROW-8392) [Java] Fix overflow related corner cases for vector value comparison

2020-04-09 Thread Liya Fan (Jira)
Liya Fan created ARROW-8392:
---

 Summary: [Java] Fix overflow related corner cases for vector value 
comparison
 Key: ARROW-8392
 URL: https://issues.apache.org/jira/browse/ARROW-8392
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan


1. Fix corner cases related to overflow.
2. Provide test cases for the corner cases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8380) [RUST] StringDictionaryBuilder not publicly exported from arrow::array

2020-04-09 Thread Jira
Jörn Horstmann created ARROW-8380:
-

 Summary: [RUST] StringDictionaryBuilder not publicly exported from 
arrow::array
 Key: ARROW-8380
 URL: https://issues.apache.org/jira/browse/ARROW-8380
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jörn Horstmann






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Python] black vs. autopep8

2020-04-09 Thread Uwe L. Korn
The non-configurability of black is one of the strongest arguments I see for 
black. The codestyle will always be subjective. From previous discussions I 
know that my personal preference of readability conflicts with that of Antoine 
and Wes, so will probably others. We have the same issue with using the Google 
C++ style guide in C++. Not everyone agrees with it but at least it is a guide 
that is adopted widely (and thus seen in other projects quite often) and has 
well outlined reasonings for most of its choices.

On Wed, Apr 8, 2020, at 10:25 PM, Xinbin Huang wrote:
> Another option that we can look into is yapf (
> https://github.com/google/yapf). It is similar to black but more tweakable.
> Also, it is recently adopted by the Apache Beam project. PR is here
> https://github.com/apache/beam/pull/10684/files
> 
> Bin
> 
> On Wed, Apr 8, 2020 at 1:18 PM Wes McKinney  wrote:
> 
> > I don't think it's possible unfortunately. From the README: "Black
> > reformats entire files in place. It is not configurable."
> >
> > The main concern about Black is the impact that it has on readability.
> > I share this concern as the subjective style choices it makes are
> > quite different from the way I've been writing Python code the last 12
> > years. That doesn't mean I'm right and it's wrong, of course. In this
> > project, it doesn't seem like we're losing much energy to people
> > reformatting arguments lists or adding line breaks here and there,
> > etc. FWIW, from reading the Black docs, it doesn't strike me that the
> > tool is designed to maximize readability, but rather to make
> > formatting deterministic and reduce code diffs caused by reformatting.
> >
> > My opinion is that we should simply avoid debating code style in code
> > reviews as long as the code passes the PEP8 checks. Employing autopep8
> > is an improvement over the status quo (which is that developers must
> > fix flake8 / PEP8 warnings by hand). We can always revisit the Black
> > discussion in the future if we find that subjective code formatting
> > (beyond PEP8 compliance) is taking up our energy inappropriately.
> >
> > On Wed, Apr 8, 2020 at 2:13 PM Rok Mihevc  wrote:
> > >
> > > Could we 'tone down' black to get the desired behavior?
> > > I'm ok with either tool.
> > >
> > > Rok
> > >
> > > On Wed, Apr 8, 2020 at 8:00 PM Wes McKinney  wrote:
> > >
> > > > On Wed, Apr 8, 2020 at 12:47 PM Neal Richardson
> > > >  wrote:
> > > > >
> > > > > So autopep8 doesn't fix everything? Sounds inferior to me. That
> > said, I'm
> > > > > in favor of any resolution that increases our automation of this and
> > > > > decreases the energy we expend debating it.
> > > >
> > > > It does fix everything, where "everything" is compliance with PEP8,
> > > > which I think is the thing we are most interested in.
> > > >
> > > > Black makes a bunch of other arbitrary (albeit consistent)
> > > > reformattings that don't affect PEP8 compliance.
> > > >
> > > > > Neal
> > > > >
> > > > >
> > > > > On Wed, Apr 8, 2020 at 10:34 AM Wes McKinney 
> > > > wrote:
> > > > >
> > > > > > Circling back on this, it seems there isn't consensus about
> > switching
> > > > > > to Black, and using autopep8 at least will give us an easy way to
> > > > > > maintain PEP8 compliance and help contributors fix linting failures
> > > > > > detected by flake8 (but not all, e.g. unused imports would need to
> > be
> > > > > > manually removed). Would everyone be on board with using autopep8?
> > > > > >
> > > > > > On Thu, Apr 2, 2020 at 9:07 AM Wes McKinney 
> > > > wrote:
> > > > > > >
> > > > > > > I'm personally fine with the Black changes. After the one-time
> > cost
> > > > of
> > > > > > > reformatting the codebase, it will take any personal preferences
> > out
> > > > > > > of code formatting (I admit that I have several myself, but I
> > don't
> > > > > > > mind the normalization provided by Black). I hope that Cython
> > support
> > > > > > > comes soon since a great deal of our code is Cython
> > > > > > >
> > > > > > > On Thu, Apr 2, 2020 at 9:00 AM Jacek Pliszka <
> > > > jacek.plis...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi!
> > > > > > > >
> > > > > > > > I believe amount of changes is not that important.
> > > > > > > >
> > > > > > > > In my opinion, what matters is which format will allow
> > reviewers
> > > > to be
> > > > > > > > more efficient.
> > > > > > > >
> > > > > > > > The committer can always reformat as they like. It is harder
> > for
> > > > the
> > > > > > reviewer.
> > > > > > > >
> > > > > > > > BR,
> > > > > > > >
> > > > > > > > Jacek
> > > > > > > >
> > > > > > > > czw., 2 kwi 2020 o 15:32 Antoine Pitrou 
> > > > > > napisał(a):
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > PS: in both cases, Cython files are not processed.  autopep8
> > is
> > > > > > actually
> > > > > > > > > able to process them, but the comparison wouldn't be
> > > > > > apples-to-apples.
> > > > > > > > >
> > > > > > > > > (that said, autopep8 

Re: [C++] Compute: Datum and "ChunkedArray&" inputs

2020-04-09 Thread Antoine Pitrou


It seems there are two different concerns here:
- the kernels' public API
- the ease with which kernels can be implemented.

If we need a different public API, then IMO we should change it sooner
rather than later.

As for the implementation, perhaps we should start by drawing
the main categories of kernels.  For example:
- item-wise kernels (and/or vector-wise)
- aggregate kernels

Some kernels will not fall in these categories (e.g. filter, join...),
but many of them do.

Regards

Antoine.


On Wed, 8 Apr 2020 16:04:30 -0500
Wes McKinney  wrote:
> On Wed, Apr 8, 2020 at 4:01 PM Wes McKinney  wrote:
> >
> > Another idea would be to have a variant with const-references instead
> > of shared_ptr. One potential issue with our Datum is that it plays the
> > dual role of transporting both input and output arguments. With
> > outputs it's necessary to be able to convey ownership while with
> > inputs this is less important.
> >
> > In general, I would say that some additional thought is needed about
> > our kernel APIs. Some scattered thoughts about this:
> >
> > * Kernel "binding" (selecting a kernel given input types) is a bit ad
> > hoc. Many kernels do not have the ability to tell you up front what
> > type of data will be returned (this is very important in the context
> > of a query engine runtime)  
> 
> Another observation is that many of our kernels are unable to write
> into preallocated memory. This is also a design issue that must be
> addressed (kernels having the same output type invoked in succession
> should be able to elide temporary allocations by writing into the same
> output memory)
> 
> > * It's unclear that having a large collection of
> > $FUNC(FunctionContext*, Datum $arg0, ..., Datum* out) is sustainable.
> >
> > Rather, it might be better to have something like:
> >
> > // We know the "schema" of each argument but don't have any data yet
> > std::string kernel_name = "$name";
> > auto bound_kernel = GetKernel(kernel_name, {arg0_shape, arg1_shape}, 
> > options);
> > auto out = bound_kernel->Call({arg0, arg1});
> >
> > So here it would be
> >
> > ARROW_ASSIGN_OR_RAISE(auto take_kernel,
> >   GetKernel("take", {shapes::chunked_array(int8()),
> >  shapes::chunked_array(int8())}));
> >
> > So now take_kernel->shape() should tell you that the expected output
> > is shapes::chunked_array(int8())
> >
> > And Call should preferably accept reference arguments for its input 
> > parameters.
> >
> > As far as kernel-specific options, we could create a variant that
> > includes the different kernel-specific options structures, so that
> > it's not necessary to have different dispatch functions for different
> > kernels.
> >
> > Anyway, just some out loud thoughts, but while we are in this
> > intermediate experimental state I'm supportive of you making the
> > decision that is most convenient for the immediate problem you want to
> > solve with the caveat that things may be refactored significantly in
> > the proximate future.
> >
> > - Wes
> >
> > On Tue, Apr 7, 2020 at 10:42 AM Uwe L. Korn  wrote:  
> > >
> > > I did a bit more research on JIRA and we seem to have this open topic 
> > > there also in https://issues.apache.org/jira/browse/ARROW-6959 which is 
> > > the similar topic as my mail is about and in 
> > > https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some 
> > > of the interfaces with reference-types.
> > >
> > > On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote:  
> > > > Hello all,
> > > >
> > > > I'm in the progress of changing the implementation of the Take kernel
> > > > to work on ChunkedArrays without concatenating them into a single Array
> > > > first. While working on the implementation, I realised that we switch
> > > > often between Datum and the specific-typed parameters. This works quite
> > > > well for the combination of Array& and Datum(shared_ptr) as
> > > > here the reference object with type Array& always carries a shared
> > > > reference with it, so switching between Array& and its Datum is quite
> > > > easy.
> > > >
> > > > In contrast, we cannot do this with ChunkedArrays as here the Datum
> > > > requires a shared_ptr which cannot be constructed from
> > > > the reference type. Thus to allow interfaces like `Status
> > > > Take(FunctionContext* ctx, const ChunkedArray& values, const Array&
> > > > indices,` to pass successfully their arguments to the Kernel
> > > > implementation, we have to do:
> > > >
> > > > a) Remove the references from the interface of the Take() function and
> > > > use `shared_ptr` instances everywhere.
> > > > b) Add interfaces to kernels like the TakeKernel that allow calling
> > > > with specific references instead of Datum instances
> > > >
> > > > Personally I would prefer b) as this allow us to make more use of the
> > > > C++ type system and would also avoid the shared_ptr overhead where not
> > > > necessary.
> > > >
> > > > Cheers,
> > > > 

[jira] [Created] (ARROW-8381) [C++][Dataset] Dataset writing should require a writer schema

2020-04-09 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8381:
-

 Summary: [C++][Dataset] Dataset writing should require a writer 
schema
 Key: ARROW-8381
 URL: https://issues.apache.org/jira/browse/ARROW-8381
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Dataset
Reporter: Francois Saint-Jacques


# Dataset writing should always take an explicit writer schema instead of the 
first fragment's schema.
# The MakeWritePlanImpl should not try removing columns that are found in the 
partition, this is left to the caller by passing an explicit schema.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8386) [Python] pyarrow.jvm raises error for empty Arrays

2020-04-09 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-8386:
---

 Summary: [Python] pyarrow.jvm raises error for empty Arrays
 Key: ARROW-8386
 URL: https://issues.apache.org/jira/browse/ARROW-8386
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler


In the pyarrow.jvm module, when there is an empty array in Java, trying to 
create it in python raises a ValueError. This is because for an empty array, 
Java returns an empty list of buffers, then pyarrow.jvm attempts to create the 
array with pa.Array.from_buffers with an empty list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Python] black vs. autopep8

2020-04-09 Thread Rok Mihevc
+1 for autopep8

On Thu, Apr 9, 2020 at 4:45 PM Wes McKinney  wrote:

> So to summarize, it seems that what we are agreeing in this thread is
> to not debate readability about otherwise PEP8-compliant Python code
> in code reviews, is that right? Absent a consensus about a change, the
> outcome is basically "no change". The addition of autopep8 as a tool
> for automated PEP8 fixes (as opposed to manual ones, which is what's
> required right now) seems reasonable in that it does not make invasive
> changes to the current codebase.
>
> We can also abandon this discussion and not use any tool. Either way
> it would be good to reach a conclusion and move on.
>
> - Wes
>
> On Thu, Apr 9, 2020 at 3:05 AM Uwe L. Korn  wrote:
> >
> > The non-configurability of black is one of the strongest arguments I see
> for black. The codestyle will always be subjective. From previous
> discussions I know that my personal preference of readability conflicts
> with that of Antoine and Wes, so will probably others. We have the same
> issue with using the Google C++ style guide in C++. Not everyone agrees
> with it but at least it is a guide that is adopted widely (and thus seen in
> other projects quite often) and has well outlined reasonings for most of
> its choices.
> >
> > On Wed, Apr 8, 2020, at 10:25 PM, Xinbin Huang wrote:
> > > Another option that we can look into is yapf (
> > > https://github.com/google/yapf). It is similar to black but more
> tweakable.
> > > Also, it is recently adopted by the Apache Beam project. PR is here
> > > https://github.com/apache/beam/pull/10684/files
> > >
> > > Bin
> > >
> > > On Wed, Apr 8, 2020 at 1:18 PM Wes McKinney 
> wrote:
> > >
> > > > I don't think it's possible unfortunately. From the README: "Black
> > > > reformats entire files in place. It is not configurable."
> > > >
> > > > The main concern about Black is the impact that it has on
> readability.
> > > > I share this concern as the subjective style choices it makes are
> > > > quite different from the way I've been writing Python code the last
> 12
> > > > years. That doesn't mean I'm right and it's wrong, of course. In this
> > > > project, it doesn't seem like we're losing much energy to people
> > > > reformatting arguments lists or adding line breaks here and there,
> > > > etc. FWIW, from reading the Black docs, it doesn't strike me that the
> > > > tool is designed to maximize readability, but rather to make
> > > > formatting deterministic and reduce code diffs caused by
> reformatting.
> > > >
> > > > My opinion is that we should simply avoid debating code style in code
> > > > reviews as long as the code passes the PEP8 checks. Employing
> autopep8
> > > > is an improvement over the status quo (which is that developers must
> > > > fix flake8 / PEP8 warnings by hand). We can always revisit the Black
> > > > discussion in the future if we find that subjective code formatting
> > > > (beyond PEP8 compliance) is taking up our energy inappropriately.
> > > >
> > > > On Wed, Apr 8, 2020 at 2:13 PM Rok Mihevc 
> wrote:
> > > > >
> > > > > Could we 'tone down' black to get the desired behavior?
> > > > > I'm ok with either tool.
> > > > >
> > > > > Rok
> > > > >
> > > > > On Wed, Apr 8, 2020 at 8:00 PM Wes McKinney 
> wrote:
> > > > >
> > > > > > On Wed, Apr 8, 2020 at 12:47 PM Neal Richardson
> > > > > >  wrote:
> > > > > > >
> > > > > > > So autopep8 doesn't fix everything? Sounds inferior to me. That
> > > > said, I'm
> > > > > > > in favor of any resolution that increases our automation of
> this and
> > > > > > > decreases the energy we expend debating it.
> > > > > >
> > > > > > It does fix everything, where "everything" is compliance with
> PEP8,
> > > > > > which I think is the thing we are most interested in.
> > > > > >
> > > > > > Black makes a bunch of other arbitrary (albeit consistent)
> > > > > > reformattings that don't affect PEP8 compliance.
> > > > > >
> > > > > > > Neal
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Apr 8, 2020 at 10:34 AM Wes McKinney <
> wesmck...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Circling back on this, it seems there isn't consensus about
> > > > switching
> > > > > > > > to Black, and using autopep8 at least will give us an easy
> way to
> > > > > > > > maintain PEP8 compliance and help contributors fix linting
> failures
> > > > > > > > detected by flake8 (but not all, e.g. unused imports would
> need to
> > > > be
> > > > > > > > manually removed). Would everyone be on board with using
> autopep8?
> > > > > > > >
> > > > > > > > On Thu, Apr 2, 2020 at 9:07 AM Wes McKinney <
> wesmck...@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > I'm personally fine with the Black changes. After the
> one-time
> > > > cost
> > > > > > of
> > > > > > > > > reformatting the codebase, it will take any personal
> preferences
> > > > out
> > > > > > > > > of code formatting (I admit that I have several myself,
> but I
> > > > don't
> > > > > > > > > 

Re: [C++] Compute: Datum and "ChunkedArray&" inputs

2020-04-09 Thread Wes McKinney
On Thu, Apr 9, 2020, 5:25 AM Antoine Pitrou  wrote:

>
> It seems there are two different concerns here:
> - the kernels' public API
> - the ease with which kernels can be implemented.
>
> If we need a different public API, then IMO we should change it sooner
> rather than later.
>

Yes, based on the problems I've listed I think some significant time needs
to be invested in this after 0.17.0 is released. I can allocate some of my
own time to it.


As for the implementation, perhaps we should start by drawing
> the main categories of kernels.  For example:
> - item-wise kernels (and/or vector-wise)
> - aggregate kernels
>
> Some kernels will not fall in these categories (e.g. filter, join...),
> but many of them do.
>

Yes, it's certainly important to distinguish been "operators" (which are
data flow processing nodes like joins, aggregations, filters, projections)
and functions / kernels (which perform granular units of work on concrete
batches of data). Some operators can be used in "one-shot" mode (eg
obtaining a result from a static unit of data) though.


Regards
>
> Antoine.
>
>
> On Wed, 8 Apr 2020 16:04:30 -0500
> Wes McKinney  wrote:
> > On Wed, Apr 8, 2020 at 4:01 PM Wes McKinney  wrote:
> > >
> > > Another idea would be to have a variant with const-references instead
> > > of shared_ptr. One potential issue with our Datum is that it plays the
> > > dual role of transporting both input and output arguments. With
> > > outputs it's necessary to be able to convey ownership while with
> > > inputs this is less important.
> > >
> > > In general, I would say that some additional thought is needed about
> > > our kernel APIs. Some scattered thoughts about this:
> > >
> > > * Kernel "binding" (selecting a kernel given input types) is a bit ad
> > > hoc. Many kernels do not have the ability to tell you up front what
> > > type of data will be returned (this is very important in the context
> > > of a query engine runtime)
> >
> > Another observation is that many of our kernels are unable to write
> > into preallocated memory. This is also a design issue that must be
> > addressed (kernels having the same output type invoked in succession
> > should be able to elide temporary allocations by writing into the same
> > output memory)
> >
> > > * It's unclear that having a large collection of
> > > $FUNC(FunctionContext*, Datum $arg0, ..., Datum* out) is sustainable.
> > >
> > > Rather, it might be better to have something like:
> > >
> > > // We know the "schema" of each argument but don't have any data yet
> > > std::string kernel_name = "$name";
> > > auto bound_kernel = GetKernel(kernel_name, {arg0_shape, arg1_shape},
> options);
> > > auto out = bound_kernel->Call({arg0, arg1});
> > >
> > > So here it would be
> > >
> > > ARROW_ASSIGN_OR_RAISE(auto take_kernel,
> > >   GetKernel("take", {shapes::chunked_array(int8()),
> > >
> shapes::chunked_array(int8())}));
> > >
> > > So now take_kernel->shape() should tell you that the expected output
> > > is shapes::chunked_array(int8())
> > >
> > > And Call should preferably accept reference arguments for its input
> parameters.
> > >
> > > As far as kernel-specific options, we could create a variant that
> > > includes the different kernel-specific options structures, so that
> > > it's not necessary to have different dispatch functions for different
> > > kernels.
> > >
> > > Anyway, just some out loud thoughts, but while we are in this
> > > intermediate experimental state I'm supportive of you making the
> > > decision that is most convenient for the immediate problem you want to
> > > solve with the caveat that things may be refactored significantly in
> > > the proximate future.
> > >
> > > - Wes
> > >
> > > On Tue, Apr 7, 2020 at 10:42 AM Uwe L. Korn 
> wrote:
> > > >
> > > > I did a bit more research on JIRA and we seem to have this open
> topic there also in https://issues.apache.org/jira/browse/ARROW-6959
> which is the similar topic as my mail is about and in
> https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some
> of the interfaces with reference-types.
> > > >
> > > > On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote:
> > > > > Hello all,
> > > > >
> > > > > I'm in the progress of changing the implementation of the Take
> kernel
> > > > > to work on ChunkedArrays without concatenating them into a single
> Array
> > > > > first. While working on the implementation, I realised that we
> switch
> > > > > often between Datum and the specific-typed parameters. This works
> quite
> > > > > well for the combination of Array& and
> Datum(shared_ptr) as
> > > > > here the reference object with type Array& always carries a shared
> > > > > reference with it, so switching between Array& and its Datum is
> quite
> > > > > easy.
> > > > >
> > > > > In contrast, we cannot do this with ChunkedArrays as here the Datum
> > > > > requires a shared_ptr which cannot be constructed
> from
> > > > > the reference type. Thus to allow 

[jira] [Created] (ARROW-8382) [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes

2020-04-09 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8382:
-

 Summary: [C++][Dataset] Refactor WritePlan to decouple from 
Fragment/Scan/Partition classes 
 Key: ARROW-8382
 URL: https://issues.apache.org/jira/browse/ARROW-8382
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


WritePlan should look like the following. 

{code:c++}
class ARROW_DS_EXPORT WritePlan {
 public:
  /// Execute the WritePlan and return a FileSystemDataset as a result.
 Result Execute();

 protected:
  /// The schema of the Dataset which will be written
  std::shared_ptr schema;

  /// The format into which fragments will be written
  std::shared_ptr format;
 
  using SourceAndReader = std::pair;
  /// 
  std::vector outputs;
};
{code}

* Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not 
sure if it should take the output schema, or the RecordBatchReader should be 
already of the right schema.
* Add a class/function that constructs SourceAndReader from Fragments, 
Partitioning and base path. And remove any Write/Fragment logic from 
partition.cc.
* Move Write() out FIleSystemDataset into WritePlan. It could take a 
FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus, 
not a requirement.
* Simplify writing routine to avoid the PathTree directory structure, it 
shouldn't be more complex than `for task in write_tasks: task()`. Not path 
construction should there.

The effects are:
* Simplified WritePlan execution, abstracted away from path construction, and 
can write to multiple FileSystem and/or Buffers since it doesn't construct the 
FileSource.
* By the virtue of using RecordBatchReader instead of Fragment, it isn't tied 
to writing from Fragment, it can take any construct that yields a 
RecordBatchReader. It also means that WritePlan doesn't have to know about any 
Scan related classes.
* Writing can be done with or without partitioning, this logic is given to 
whomever generates the SourceAndReader list.
* Should be simpler to test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8385) Crash on parquet.read_table on windows python 3.82

2020-04-09 Thread Geoff Quested-Joens (Jira)
Geoff Quested-Joens created ARROW-8385:
--

 Summary: Crash on parquet.read_table on windows python 3.82
 Key: ARROW-8385
 URL: https://issues.apache.org/jira/browse/ARROW-8385
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
 Environment: Window 10 
python 3.8.2 pip 20.0.2
pip freeze ->
numpy==1.18.2
pandas==1.0.3
pyarrow==0.16.0
python-dateutil==2.8.1
pytz==2019.3
six==1.14.0
Reporter: Geoff Quested-Joens
 Attachments: crash.parquet

On read of parquet file using pyarrow the program spontaneously exits no thrown 
exceptions windows only. Testing the same setup on linux (debian 10 in a 
Docker) reading the same parquet file is done without issue.

The follow can reproduce the crash in a python 3.8.2 environment env listed 
bellow but is essentially pip install pandas and pyarrow.
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def test_pandas_write_read():
df_out = pd.DataFrame.from_dict([{"A":i} for i in range(3)])
df_out.to_parquet("crash.parquet")
df_in = pd.read_parquet("crash.parquet")
print(df_in)

def test_arrow_write_read():
df = pd.DataFrame.from_dict([{"A":i} for i in range(3)])
table_out = pa.Table.from_pandas(df)
pq.write_table(table_out, 'crash.parquet')
table_in = pq.read_table('crash.parquet')
print(table_in)

if _name_ == "_main_":
test_pandas_write_read()
test_arrow_write_read()
{code}
 The interpreter never reaches the print statements crashing somewhere in the 
call on line 252 of {{parquet.py}} no error is thrown just spontaneous program 
exit.
{code:python}
self.reader.read_all(...
{code}
In contrast running the same code and python environment in debian 10 there is 
no error reading the parquet files generated by the same windows code. The 
sha2sum compare equal for the crash.parquet generated running on debian and 
windows so something appears to be up with the read. Attached is the 
crash.parquet file generated on my machine.

Obtusely changing the {{range(3)}} to {{range(2)}} gets rid of the crash on 
windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8383) [RUST] Easier random access to DictionaryArray keys and values

2020-04-09 Thread Jira
Jörn Horstmann created ARROW-8383:
-

 Summary: [RUST] Easier random access to DictionaryArray keys and 
values
 Key: ARROW-8383
 URL: https://issues.apache.org/jira/browse/ARROW-8383
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jörn Horstmann


Currently it's not that clear how to acces DictionaryArray keys and values 
using random indices.

The `DictionaryArray::keys` method exposes an Iterator with an `nth` method, 
but this requires a mut reference and feels a little bit out of place compared 
to other methods of accessing arrow data.

Another alternative seems to be to use the `From for 
PrimitiveArray` conversion like so `let keys : Int16Array = 
dictionary_array.data().into()`. This seems to work fine but is not easily 
discoverable and also needs to be done outside of any loops for performance 
reasons.

I'd like methods on `DictionaryArray` to directly get the key at some index

```
 pub fn key(, i: usize) -> 
```

Ideally I'd also like an easier way to directly access values at some index, at 
least when those are primitive or string types.

```
pub fn value(, i: usize) -> 
```

I'm not sure how or if that would be possible to implement with rust generics.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Python] black vs. autopep8

2020-04-09 Thread Wes McKinney
So to summarize, it seems that what we are agreeing in this thread is
to not debate readability about otherwise PEP8-compliant Python code
in code reviews, is that right? Absent a consensus about a change, the
outcome is basically "no change". The addition of autopep8 as a tool
for automated PEP8 fixes (as opposed to manual ones, which is what's
required right now) seems reasonable in that it does not make invasive
changes to the current codebase.

We can also abandon this discussion and not use any tool. Either way
it would be good to reach a conclusion and move on.

- Wes

On Thu, Apr 9, 2020 at 3:05 AM Uwe L. Korn  wrote:
>
> The non-configurability of black is one of the strongest arguments I see for 
> black. The codestyle will always be subjective. From previous discussions I 
> know that my personal preference of readability conflicts with that of 
> Antoine and Wes, so will probably others. We have the same issue with using 
> the Google C++ style guide in C++. Not everyone agrees with it but at least 
> it is a guide that is adopted widely (and thus seen in other projects quite 
> often) and has well outlined reasonings for most of its choices.
>
> On Wed, Apr 8, 2020, at 10:25 PM, Xinbin Huang wrote:
> > Another option that we can look into is yapf (
> > https://github.com/google/yapf). It is similar to black but more tweakable.
> > Also, it is recently adopted by the Apache Beam project. PR is here
> > https://github.com/apache/beam/pull/10684/files
> >
> > Bin
> >
> > On Wed, Apr 8, 2020 at 1:18 PM Wes McKinney  wrote:
> >
> > > I don't think it's possible unfortunately. From the README: "Black
> > > reformats entire files in place. It is not configurable."
> > >
> > > The main concern about Black is the impact that it has on readability.
> > > I share this concern as the subjective style choices it makes are
> > > quite different from the way I've been writing Python code the last 12
> > > years. That doesn't mean I'm right and it's wrong, of course. In this
> > > project, it doesn't seem like we're losing much energy to people
> > > reformatting arguments lists or adding line breaks here and there,
> > > etc. FWIW, from reading the Black docs, it doesn't strike me that the
> > > tool is designed to maximize readability, but rather to make
> > > formatting deterministic and reduce code diffs caused by reformatting.
> > >
> > > My opinion is that we should simply avoid debating code style in code
> > > reviews as long as the code passes the PEP8 checks. Employing autopep8
> > > is an improvement over the status quo (which is that developers must
> > > fix flake8 / PEP8 warnings by hand). We can always revisit the Black
> > > discussion in the future if we find that subjective code formatting
> > > (beyond PEP8 compliance) is taking up our energy inappropriately.
> > >
> > > On Wed, Apr 8, 2020 at 2:13 PM Rok Mihevc  wrote:
> > > >
> > > > Could we 'tone down' black to get the desired behavior?
> > > > I'm ok with either tool.
> > > >
> > > > Rok
> > > >
> > > > On Wed, Apr 8, 2020 at 8:00 PM Wes McKinney  wrote:
> > > >
> > > > > On Wed, Apr 8, 2020 at 12:47 PM Neal Richardson
> > > > >  wrote:
> > > > > >
> > > > > > So autopep8 doesn't fix everything? Sounds inferior to me. That
> > > said, I'm
> > > > > > in favor of any resolution that increases our automation of this and
> > > > > > decreases the energy we expend debating it.
> > > > >
> > > > > It does fix everything, where "everything" is compliance with PEP8,
> > > > > which I think is the thing we are most interested in.
> > > > >
> > > > > Black makes a bunch of other arbitrary (albeit consistent)
> > > > > reformattings that don't affect PEP8 compliance.
> > > > >
> > > > > > Neal
> > > > > >
> > > > > >
> > > > > > On Wed, Apr 8, 2020 at 10:34 AM Wes McKinney 
> > > > > wrote:
> > > > > >
> > > > > > > Circling back on this, it seems there isn't consensus about
> > > switching
> > > > > > > to Black, and using autopep8 at least will give us an easy way to
> > > > > > > maintain PEP8 compliance and help contributors fix linting 
> > > > > > > failures
> > > > > > > detected by flake8 (but not all, e.g. unused imports would need to
> > > be
> > > > > > > manually removed). Would everyone be on board with using autopep8?
> > > > > > >
> > > > > > > On Thu, Apr 2, 2020 at 9:07 AM Wes McKinney 
> > > > > wrote:
> > > > > > > >
> > > > > > > > I'm personally fine with the Black changes. After the one-time
> > > cost
> > > > > of
> > > > > > > > reformatting the codebase, it will take any personal preferences
> > > out
> > > > > > > > of code formatting (I admit that I have several myself, but I
> > > don't
> > > > > > > > mind the normalization provided by Black). I hope that Cython
> > > support
> > > > > > > > comes soon since a great deal of our code is Cython
> > > > > > > >
> > > > > > > > On Thu, Apr 2, 2020 at 9:00 AM Jacek Pliszka <
> > > > > jacek.plis...@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi!
> 

[jira] [Created] (ARROW-8384) [C++][Python] arrow/filesystem/hdfs.h and Python wrapper does not have an option for setting a path to a Kerberos ticket

2020-04-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8384:
---

 Summary: [C++][Python] arrow/filesystem/hdfs.h and Python wrapper 
does not have an option for setting a path to a Kerberos ticket
 Key: ARROW-8384
 URL: https://issues.apache.org/jira/browse/ARROW-8384
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


This feature seems to have been dropped

Is there a plan for migrating users to the new filesystem API? We have two 
different code paths now



--
This message was sent by Atlassian Jira
(v8.3.4#803005)