Re: Subject: [VOTE] Release Apache Arrow 0.15.0 - RC1

2019-09-30 Thread Sutou Kouhei
If we don't care the rustfmt check for release, how about
removing the check from
dev/release/verify-release-candidate.sh?

In 
  "Re: Subject: [VOTE] Release Apache Arrow 0.15.0 - RC1" on Sun, 29 Sep 2019 
17:40:20 -0700,
  Andy Grove  wrote:

> Actually, I think the RC was cut just before 1.40.0 nightly was released,
> which would explain why the rustfmt check fails now. As I think you already
> said, it doesn't really matter anyway since it is just a formatting
> difference.
> 
> On Sun, Sep 29, 2019 at 2:44 PM Sutou Kouhei  wrote:
> 
>> I used dev/release/verify-release-candidate.sh. It install
>> Rust automatically.
>> Should we update
>>
>> https://github.com/apache/arrow/blob/master/dev/release/verify-release-candidate.sh#L452
>> ?
>>
>> In 
>>   "Re: Subject: [VOTE] Release Apache Arrow 0.15.0 - RC1" on Sun, 29 Sep
>> 2019 07:21:27 -0600,
>>   Andy Grove  wrote:
>>
>> > Just fyi on the rustfmt issue, the formatting was recently updated for
>> rust
>> > 1.40 nightly and if you are using an older version the formatting check
>> > will fail.
>> >
>> > On Sun, Sep 29, 2019, 5:56 AM Wes McKinney  wrote:
>> >
>> >> It's up to Micah as RM, but I think it would be good to fix the
>> sig-related
>> >> issues or we may be dealing with "bug" reports until the next release.
>> I'll
>> >> work on source verification later today in the meantime to see if any
>> other
>> >> issues turn up
>> >>
>> >> On Sun, Sep 29, 2019, 1:19 AM Sutou Kouhei  wrote:
>> >>
>> >> > -0 (binding)
>> >> >
>> >> > I ran the followings on Debian GNU/Linux sid:
>> >> >
>> >> >   * TEST_CSHARP=0 \
>> >> >   TEST_GLIB=0 \
>> >> >   TEST_RUBY=0 \
>> >> >   TEST_RUST=0 \
>> >> >   JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
>> >> >   CUDA_TOOLKIT_ROOT=/usr \
>> >> > dev/release/verify-release-candidate.sh source 0.15.0 1
>> >> >   * dev/release/verify-release-candidate.sh binaries 0.14.1 0
>> >> >
>> >> > with:
>> >> >
>> >> >   * gcc (Debian 9.2.1-7) 9.2.1
>> >> >   * openjdk version "1.8.0_212"
>> >> >   * Node.JS v12.1.0
>> >> >   * go version go1.12.9 linux/amd64
>> >> >   * nvidia-cuda-dev 10.1.105-3
>> >> >
>> >> >
>> >> > I got the following failures:
>> >> >
>> >> >   * Not ignorable:
>> >> > * Binary: Bad signature
>> >> >   * centos-rc/6/Source/repodata/repomd.xml is failed
>> >> >   * We can't ignore this if removing the file from
>> >> > https://bintray.com/apache/arrow/centos-rc/0.15.0-rc1 and
>> >> > re-uploading it doesn't solve this problem.
>> >> >
>> >> >   * Ignorable:
>> >> > * C GLib and Ruby: Buildable but can't run test with GLib 2.62.0.
>> >> >   * It's caused by gobject-introspection gem.
>> >> >   * This is a known problem and not a C GLib problem.
>> >> >   * We can ignore this. (I'm fixing gobject-introspection gem.)
>> >> > * Rust: "cargo +stable fmt --all -- --check" is failed (*)
>> >> >   * If I commented the command line out, Rust verification is
>> passed.
>> >> >   * We can ignore this. Because this is just a lint error.
>> >> > * C#: "sourcelink test" is failed
>> >> >   * We can ignore this. This is happened when we release 0.14.1
>> too.
>> >> > * APT and Yum: arm64 and aarch64 are broken
>> >> >   * We can ignore this.
>> >> >
>> >> > (*)
>> >> > 
>> >> > + cargo +stable fmt --all -- --check
>> >> > Diff in
>> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/arrow/src/array/
>> >> > builder.rs at line 1458:
>> >> >  let mut builder = StructBuilder::new(fields, field_builders);
>> >> >  assert!(builder.field_builder::(0).is_none());
>> >> >  }
>> >> > -
>> >> >  }
>> >> >
>> >> > Diff in /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/arrow/src/
>> >> > bitmap.rs at line 126:
>> >> >  assert_eq!(true, bitmap.is_set(6));
>> >> >  assert_eq!(false, bitmap.is_set(7));
>> >> >  }
>> >> > -
>> >> >  }
>> >> >
>> >> > Diff in
>> >> >
>> >>
>> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/datafusion/src/execution/
>> >> > aggregate.rs at line 1471:
>> >> >  ds,
>> >> >  )
>> >> >  }
>> >> > -
>> >> >  }
>> >> >
>> >> > Diff in
>> >> >
>> >>
>> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/datafusion/src/execution/
>> >> > context.rs at line 682:
>> >> >
>> >> >  Ok(ctx)
>> >> >  }
>> >> > -
>> >> >  }
>> >> >
>> >> > Diff in
>> >> >
>> >>
>> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/datafusion/src/execution/physical_plan/
>> >> > hash_aggregate.rs at line 720:
>> >> >
>> >> >  Ok(())
>> >> >  }
>> >> > -
>> >> >  }
>> >> >
>> >> > Diff in
>> >> >
>> >>
>> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/datafusion/src/execution/physical_plan/
>> >> > merge.rs at line 134:
>> >> >
>> >> >  Ok(())
>> >> >  }
>> >> > -
>> >> >  }
>> >> >
>> >> > Diff in
>> >> >
>> >>
>> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/datafusion/src/execution/physical_plan/
>> >> > projection.rs at line 171:

Clarifying interpretation of Buffer "length" field in Arrow protocol

2019-09-30 Thread Wes McKinney
I just updated my pull request from May adding language to clarify
what protocol writers are expected to set when producing the Arrow
binary protocol

https://github.com/apache/arrow/pull/4370

Implementations may allocate small buffers, or use memory which does
not meet the 8-byte minimal padding requirements of the Arrow
protocol. It becomes a question, then, whether to set the in-memory
buffer size or the padded size when producing the protocol.

This PR states that either is acceptable. As an example, a 1-byte
validity buffer could have Buffer metadata stating that the size
either is 1 byte or 8 bytes. Either way, 7 bytes of padding must be
written to conform to the protocol. The metadata, therefore, reflects
the "intent" of the protocol writer for the protocol reader. If the
writer says the length is 1, then the protocol reader understands that
the writer does not expect the reader to concern itself with the 7
bytes of padding. This could have implications for hashing or
comparisons, for example, so I think that having the flexibility to do
either is a good idea.

For an application that wants to guarantee that AVX512 instructions
can be used on all buffers on the receiver side, it would be
appropriate to include 512-bit padding in the accounting.

Let me know if others think differently so we can have this properly
documented for the 1.0.0 Format release.

Thanks,
Wes


Re: [DISCUSS] C-level in-process array protocol

2019-09-30 Thread Wes McKinney
A couple things:

* I think a C protocol / FFI for Arrow array/vectors would be better
to have the same "shape" as an assembled array. Note that the C
structs here have very nearly the same "shape" as the data structure
representing a C++ Array object [1]. The disassembly and reassembly
here is substantially simpler than the IPC protocol. A recursive
structure in Flatbuffers would make RecordBatch messages much larger,
so the flattened / disassembled representation we use for serialized
record batches is the correct one

* The "formal" C protocol having the "assembled" shape means that many
minimal Arrow users won't have to implement any separate data
structures. They can just use the C struct directly or a slightly
wrapped version thereof with some convenience functions.

* I think that requiring building a Flatbuffer for minimal use cases
(e.g. communicating simple record batches with primitive types) passes
on implementation burden to minimal users.

I think the mantra of the C protocol should be the following:

* Users of the protocol have to write little to no code to use it. For
example, populating an INT32 array should require only a few lines of
code
* The data structure in the protocol is suitable as an in-memory data
structure for recursive assembly of nested structures

I think that having a string miniformat or a pre-parsed type struct
with enum values (along the lines of what Antoine is describing above)
places less burden on downstream users.

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L203

On Mon, Sep 30, 2019 at 4:08 PM Antoine Pitrou  wrote:
>
>
> FlatCC is still a dependency, with generated files etc.
> Perhaps you want to evaluate FlatCC on a schema-like example and see
> what the generated code and compile instructions look like?
>
> I'll point out again that the format string in my proposal uses an
> extremely simple mini-format, that should be parsable very easily by any
> developer, even in raw C:
> https://github.com/apache/arrow/blob/3806fa9ba3ddf95f0d09b865071bf19c5e912756/docs/source/format/CProtocol.rst#data-type-descriptionformat-strings
>
> The parent-child structure in the schema is represented as-is in the
> ArrowArray parent-child relationship, so it doesn't need any encoding.
> Using Flatbuffers for an enum-like field + (at most) a couple parameters
> sounds overkill.
>
> Another possibility would be to replace the format string with
> pre-parsed fields, for example:
>
>   int32_t type;
>   int32_t subtype;  // type-dependent (e.g. unit for temporal types)
>   int32_t type_width;   // for width-parametered types
>   const int8_t* child_ids;   // for unions
>   const char* auxiliary_type_param;  // e.g. timezone for timestamp
>
> The downside is that there are more fields to consider (including two
> optional pointers).
>
> Regards
>
> Antoine.
>
>
> Le 30/09/2019 à 22:48, Ben Kietzman a écrit :
> > FlatCC seems germane: https://github.com/dvidelabs/flatcc
> >
> > It compiles flatbuffer schemas down to (idiomatic?) C
> >
> > Perhaps the schema and batch serialization problems should be solved by
> > storing everything in the flatbuffer format.
> > Then the results of running flatcc plus a few simple helpers can be checked
> > in to provide an accessible C API.
> > With respect to lifetime, Antoine has already done good work on specifying
> > a move only contract which could probably be adapted.
> >
> >
> > On Sun, Sep 29, 2019 at 2:44 PM Antoine Pitrou  wrote:
> >
> >>
> >> One basic design point is to allow exchanging Arrow data with no
> >> mandatory dependency (the exception is JSON and base64 if you want to
> >> act on metadata - but that's highly optional, and those are extremely
> >> widespread formats).  I'm afraid that Flatbuffers may be a deterrent:
> >> not only it introduces a library, but it requires the use of a compiler
> >> to produce generated code.  It also requires familiarizing with, well,
> >> Flatbuffers :-)
> >>
> >> We can of course discuss this and feel it's not a problem.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 29/09/2019 à 19:47, Wes McKinney a écrit :
> >>> There are two pieces of serialized data needed to communicate a record
> >>> batch from one library to another
> >>>
> >>> * Serialized schema (i.e. what's in Schema.fbs)
> >>> * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs
> >>>
> >>> You _do_ need to use a Flatbuffers library to fully create these
> >>> message types to interact with any existing record batch disassembly /
> >>> reassembly.
> >>>
> >>> I think I'm most concerned about having a new way to serialize
> >>> schemas. We already have JSON-based schema serialization for
> >>> integration test purposes, so one possibility is to standardize that
> >>> and make it a more formalized part of the project specification.
> >>>
> >>> As far as a C protocol, I don't see an especial downside to using the
> >>> Flatbuffers schema to communicate types.
> >>>
> >>> 

Re: [DISCUSS] C-level in-process array protocol

2019-09-30 Thread Antoine Pitrou


FlatCC is still a dependency, with generated files etc.
Perhaps you want to evaluate FlatCC on a schema-like example and see
what the generated code and compile instructions look like?

I'll point out again that the format string in my proposal uses an
extremely simple mini-format, that should be parsable very easily by any
developer, even in raw C:
https://github.com/apache/arrow/blob/3806fa9ba3ddf95f0d09b865071bf19c5e912756/docs/source/format/CProtocol.rst#data-type-descriptionformat-strings

The parent-child structure in the schema is represented as-is in the
ArrowArray parent-child relationship, so it doesn't need any encoding.
Using Flatbuffers for an enum-like field + (at most) a couple parameters
sounds overkill.

Another possibility would be to replace the format string with
pre-parsed fields, for example:

  int32_t type;
  int32_t subtype;  // type-dependent (e.g. unit for temporal types)
  int32_t type_width;   // for width-parametered types
  const int8_t* child_ids;   // for unions
  const char* auxiliary_type_param;  // e.g. timezone for timestamp

The downside is that there are more fields to consider (including two
optional pointers).

Regards

Antoine.


Le 30/09/2019 à 22:48, Ben Kietzman a écrit :
> FlatCC seems germane: https://github.com/dvidelabs/flatcc
> 
> It compiles flatbuffer schemas down to (idiomatic?) C
> 
> Perhaps the schema and batch serialization problems should be solved by
> storing everything in the flatbuffer format.
> Then the results of running flatcc plus a few simple helpers can be checked
> in to provide an accessible C API.
> With respect to lifetime, Antoine has already done good work on specifying
> a move only contract which could probably be adapted.
> 
> 
> On Sun, Sep 29, 2019 at 2:44 PM Antoine Pitrou  wrote:
> 
>>
>> One basic design point is to allow exchanging Arrow data with no
>> mandatory dependency (the exception is JSON and base64 if you want to
>> act on metadata - but that's highly optional, and those are extremely
>> widespread formats).  I'm afraid that Flatbuffers may be a deterrent:
>> not only it introduces a library, but it requires the use of a compiler
>> to produce generated code.  It also requires familiarizing with, well,
>> Flatbuffers :-)
>>
>> We can of course discuss this and feel it's not a problem.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 29/09/2019 à 19:47, Wes McKinney a écrit :
>>> There are two pieces of serialized data needed to communicate a record
>>> batch from one library to another
>>>
>>> * Serialized schema (i.e. what's in Schema.fbs)
>>> * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs
>>>
>>> You _do_ need to use a Flatbuffers library to fully create these
>>> message types to interact with any existing record batch disassembly /
>>> reassembly.
>>>
>>> I think I'm most concerned about having a new way to serialize
>>> schemas. We already have JSON-based schema serialization for
>>> integration test purposes, so one possibility is to standardize that
>>> and make it a more formalized part of the project specification.
>>>
>>> As far as a C protocol, I don't see an especial downside to using the
>>> Flatbuffers schema to communicate types.
>>>
>>> Another thought is to not deviate from the flattened
>>> Flatbuffers-styled representation but to translate the Flatbuffers
>>> types into C types: namely a C struct-based version of the
>>> "RecordBatch" message.
>>>
>>> Independent of the means to communicate the two pieces of serialized
>>> information above (respectively: schemas and record batch field memory
>>> addresses and field lengths), having a C-based FFI where project can
>>> drop in a header file containing the ABI they are supposed to
>>> implement, that seems pretty reasonable to me.
>>>
>>> If we don't define a standardized in-memory FFI (whether it uses the
>>> Flatbuffers objects as inputs/outputs or not) then downstream project
>>> will devise their own, and that will cause issues long term.
>>>
>>> On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou 
>> wrote:


 Le 29/09/2019 à 06:10, Jacques Nadeau a écrit :
> * No dependency on Flatbuffers.
> * No buffer reassembly (data is already exposed in logical Arrow
>> format).
> * Zero-copy by design.
> * Easy to reimplement from scratch.
>
> I don't see how the flatbuffer pattern for data headers doesn't
>> accomplish
> all of these things. At its definition, is a very simple
>> representation of
> data that could be worked with independently of the flatbuffers
>> codebase.
> It was designed so systems could map directly into that memory without
> interacting with a flatbuffers library.
>
> Specifically the following three structures were designed to already
>> allow
> what I think this proposal is trying to recreate. All three are very
>> simple
> to construct in a direct, non-flatbuffer dependent read/write pattern.

 Are they?  Personally, I wouldn't 

Re: [DISCUSS] C-level in-process array protocol

2019-09-30 Thread Ben Kietzman
FlatCC seems germane: https://github.com/dvidelabs/flatcc

It compiles flatbuffer schemas down to (idiomatic?) C

Perhaps the schema and batch serialization problems should be solved by
storing everything in the flatbuffer format.
Then the results of running flatcc plus a few simple helpers can be checked
in to provide an accessible C API.
With respect to lifetime, Antoine has already done good work on specifying
a move only contract which could probably be adapted.


On Sun, Sep 29, 2019 at 2:44 PM Antoine Pitrou  wrote:

>
> One basic design point is to allow exchanging Arrow data with no
> mandatory dependency (the exception is JSON and base64 if you want to
> act on metadata - but that's highly optional, and those are extremely
> widespread formats).  I'm afraid that Flatbuffers may be a deterrent:
> not only it introduces a library, but it requires the use of a compiler
> to produce generated code.  It also requires familiarizing with, well,
> Flatbuffers :-)
>
> We can of course discuss this and feel it's not a problem.
>
> Regards
>
> Antoine.
>
>
> Le 29/09/2019 à 19:47, Wes McKinney a écrit :
> > There are two pieces of serialized data needed to communicate a record
> > batch from one library to another
> >
> > * Serialized schema (i.e. what's in Schema.fbs)
> > * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs
> >
> > You _do_ need to use a Flatbuffers library to fully create these
> > message types to interact with any existing record batch disassembly /
> > reassembly.
> >
> > I think I'm most concerned about having a new way to serialize
> > schemas. We already have JSON-based schema serialization for
> > integration test purposes, so one possibility is to standardize that
> > and make it a more formalized part of the project specification.
> >
> > As far as a C protocol, I don't see an especial downside to using the
> > Flatbuffers schema to communicate types.
> >
> > Another thought is to not deviate from the flattened
> > Flatbuffers-styled representation but to translate the Flatbuffers
> > types into C types: namely a C struct-based version of the
> > "RecordBatch" message.
> >
> > Independent of the means to communicate the two pieces of serialized
> > information above (respectively: schemas and record batch field memory
> > addresses and field lengths), having a C-based FFI where project can
> > drop in a header file containing the ABI they are supposed to
> > implement, that seems pretty reasonable to me.
> >
> > If we don't define a standardized in-memory FFI (whether it uses the
> > Flatbuffers objects as inputs/outputs or not) then downstream project
> > will devise their own, and that will cause issues long term.
> >
> > On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou 
> wrote:
> >>
> >>
> >> Le 29/09/2019 à 06:10, Jacques Nadeau a écrit :
> >>> * No dependency on Flatbuffers.
> >>> * No buffer reassembly (data is already exposed in logical Arrow
> format).
> >>> * Zero-copy by design.
> >>> * Easy to reimplement from scratch.
> >>>
> >>> I don't see how the flatbuffer pattern for data headers doesn't
> accomplish
> >>> all of these things. At its definition, is a very simple
> representation of
> >>> data that could be worked with independently of the flatbuffers
> codebase.
> >>> It was designed so systems could map directly into that memory without
> >>> interacting with a flatbuffers library.
> >>>
> >>> Specifically the following three structures were designed to already
> allow
> >>> what I think this proposal is trying to recreate. All three are very
> simple
> >>> to construct in a direct, non-flatbuffer dependent read/write pattern.
> >>
> >> Are they?  Personally, I wouldn't know how to do that.  I don't know
> >> which encoding Flatbuffers use, whether it's C ABI-compatible (how could
> >> it be? if it's portable accross different platforms, then it's probably
> >> not compatible with any particular platform's C ABI, or only as a
> >> conincidence), how I'm supposed to make use of the "offset" field, or
> >> what the lifetime / ownership of all this data is.
> >>
> >> I may be missing something, but if the answer is that it's easy to
> >> reimplement Flatbuffers' encoding without relying on the Flatbuffers
> >> project's source code, I'm a bit skeptical.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>>
> >>> struct FieldNode {
> >>>   length: long;
> >>>   null_count: long;
> >>> }
> >>>
> >>> struct Buffer {
> >>>   offset: long;
> >>>   length: long;
> >>> }
> >>>
> >>> table RecordBatch {
> >>>   length: long;
> >>>   nodes: [FieldNode];
> >>>   buffers: [Buffer];
> >>> }
> >>>
> >>> On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau 
> wrote:
> >>>
>  I'm not clear on why we need to introduce something beyond what
>  flatbuffers already provides. Can someone explain that to me? I'm not
>  really a fan of introducing a second representation of the same data
> (as I
>  understand it).
> 
>  On Thu, Sep 19, 2019 at 1:15 

[jira] [Created] (ARROW-6747) [R] Bindings for Plasma object store

2019-09-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6747:
---

 Summary: [R] Bindings for Plasma object store
 Key: ARROW-6747
 URL: https://issues.apache.org/jira/browse/ARROW-6747
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Wes McKinney


Analogous to pyarrow.plasma



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6746) [CI] Run hadolint Dockerfile lint checks somewhere else

2019-09-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6746:
---

 Summary: [CI] Run hadolint Dockerfile lint checks somewhere else
 Key: ARROW-6746
 URL: https://issues.apache.org/jira/browse/ARROW-6746
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Wes McKinney


These checks require Docker, so cannot be run _within_ a Docker container. This 
should not stand in the way of having a Docker container for the lint checks we 
run in CI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6744) Export JsonEqual trait in the array module

2019-09-30 Thread Kyle McCarthy (Jira)
Kyle McCarthy created ARROW-6744:


 Summary: Export JsonEqual trait in the array module
 Key: ARROW-6744
 URL: https://issues.apache.org/jira/browse/ARROW-6744
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Kyle McCarthy


ARROW-5901 added checking for array equality with JSON arrays. This added the 
JsonEqual trait bound to the Array trait but it isn't exported making it 
private.

The JsonEqual is a public trait, but the equal module is private and the 
JsonEqual trait isn't exported like the ArrayEqual trait.

AFAIK this makes it impossible to implement your own arrays that are bound by 
the Array trait.

I suggest that JsonEqual is exported with pub use like the ArrayEqual trait 
from the array module. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Parquet file reading performance

2019-09-30 Thread Wes McKinney
On Sat, Sep 28, 2019 at 3:16 PM Maarten Ballintijn  wrote:
>
> Hi Joris,
>
> Thanks for your detailed analysis!
>
> We can leave the impact of the large DateTimeIndex on the performance for 
> another time.
> (Notes: my laptop has sufficient memory to support it, no error is thrown, the
> resulting DateTimeIndex from the expression is identical to your version or 
> the other version
> in the test. The large DateTimeIndex is released long before the tests 
> happen, yet it has a
> massive impact?? It feels like something is broken)
>
>
> Thanks for clearly demonstrating that the main the issue is with to_pandas()
> That’s very unexpected, in the ’ns’ case I would expect no overhead.
> And even with the ‘us’ case it's only two vector compares and a factor 
> multiply, no?
> Also, Timestamps are quite ubiquitous :-)
>
>
> This leaves me with the following questions:
>
> - Who should I talk to to get this resolved in Pandas?
>
> - Where do I find out more about Parquet v2? And more generally is there an 
> RFC (or similar)
> document that defines the Parquet file format and API?

The one practical barrier to using Parquet V2 endogenously in Python
is resolving PARQUET-458, i.e. implementing the V2 data page encoding
correctly.

If you write V2 files, you may or may not be able to read them
everywhere. So if you are striving for compatibility across many
processing frameworks I would recommend sticking with V1.

For other questions I direct you to d...@parquet.apache.org

> - Do you think it would be possible to take the DateTime column out of Arrow 
> into numpy
> and transform it the to make it more amenable to Pandas? and possibly even 
> for the value columns?
>
> Thanks again and have a great weekend!
> Maarten.
>
>
>
>
> > On Sep 25, 2019, at 11:57 AM, Joris Van den Bossche 
> >  wrote:
> >
> > From looking a little bit further into this, it seems that it is mainly
> > pandas who is slower in creating a Series from an array of datetime64
> > compared from an array of ints.
> > And especially if it is not nanosecond resolution:
> >
> > In [29]: a_int = pa.array(np.arange(10))
> >
> > In [30]: %timeit a_int.to_pandas()
> > 56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 1 loops each)
> >
> > In [31]: a_datetime = pa.array(pd.date_range("2012", periods=10,
> > freq='S'))
> >
> > In [32]: %timeit a_datetime.to_pandas()
> > 1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> >
> > In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=10,
> > freq='S'), pa.timestamp('us'))
> >
> > In [34]: %timeit a_datetime_us.to_pandas()
> > 7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >
> > Creating the datetime64 array inside pyarrow is also a bit slower compared
> > to int (causing the slower conversion of a_datetime), but the above
> > difference between between nanosecond and microsecond resolution is largely
> > due to pandas, not pyarrow (because pandas needs to convert the
> > microseconds to nanoseconds, and during that conversion will also check
> > that no datetimes were out of bounds for this resolution).
> >
> > And in parquet, the datetime data of the index column will be stored in
> > microsecond resolution (even if the original pandas data was nanosecond
> > resolution). And the slower reading of the parquet file with datetime index
> > is thus almost entirely due to the above difference in timing of converting
> > the int or datetime index column to pandas.
> > Parquet nowadays actually supports storing nanosecond resolution, and this
> > can be triggered in pyarrow by passing version="2.0" to pq.write_table (but
> > last what I heard this version is not yet considered production ready).
> >
> > Joris
> >
> > On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche <
> > jorisvandenboss...@gmail.com> wrote:
> >
> >> Hi Maarten,
> >>
> >> Thanks for the reproducible script. I ran it on my laptop on pyarrow
> >> master, and not seeing the difference between both datetime indexes:
> >>
> >> Versions:
> >> Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
> >> 23:01:00)
> >> [GCC 7.3.0] on linux
> >> numpy:1.16.4
> >> pandas:   0.26.0.dev0+447.gc168ecf26
> >> pyarrow:  0.14.1.dev642+g7f2d637db
> >>
> >> 1073741824 float64 8388608 16
> >> 0: make_dataframe :   1443.483 msec,  709 MB/s
> >> 0: write_arrow_parquet:   7685.426 msec,  133 MB/s
> >> 0: read_arrow_parquet :   1262.741 msec,  811 MB/s <<<
> >> 1: make_dataframe :   1412.575 msec,  725 MB/s
> >> 1: write_arrow_parquet:   7869.145 msec,  130 MB/s
> >> 1: read_arrow_parquet :   1947.896 msec,  526 MB/s <<<
> >> 2: make_dataframe :   1490.165 msec,  687 MB/s
> >> 2: write_arrow_parquet:   7040.507 msec,  145 MB/s
> >> 2: read_arrow_parquet :   1888.316 msec,  542 MB/s <<<
> >>
> >> The only change I needed to make in the script to get it 

Re: Unnesting ListArrays

2019-09-30 Thread Wes McKinney
hi Suhail -- well, unnesting produces an array of a different length.
I would think that unnesting would mainly occur in the context of
analytics, e.g.

list_values.flatten().unique()

We definitely would like to have APIs that help with doing analytics
on nested data. I had hoped to get to work on the DataFrames API in
C++ this year, but there have been other more pressing projects and
issues related to maintaining and scaling up the Arrow community so it
looks more likely a project for 2020.

- Wes

On Thu, Sep 26, 2019 at 6:09 AM Suhail Razzak  wrote:
>
> Thanks Wes, makes sense. I appreciate that there are use cases where both
> could be applicable.
>
> In my example, the most applicable I can think of is unnesting a ListArray
> column for a DataFrame (in the future C++ DataFrames API?) similar to the
> tidyr unnest function. I don't believe the current implementation wouldn't
> be able to align the flattened ListArray with the rest of the columns. I'll
> see if there's something I can do on this end.
>
> On Wed, Sep 25, 2019 at 6:27 PM Wes McKinney  wrote:
>
> > hi Suhail,
> >
> > This follows the columnar format closely. The List layout is composed
> > from a child array providing the "inner" values, which are given the
> > List interpretation by adding an offsets buffer, and a validity
> > buffer to distinguish null from 0-length list values. So flatten()
> > here just returns the child array, which has only 3 values in the
> > example you gave.
> >
> > A function could be written to insert "null" for List values that are
> > null, but someone would have to write it and give it a name =)
> >
> > - Wes
> >
> > On Wed, Sep 25, 2019 at 5:15 PM Suhail Razzak 
> > wrote:
> > >
> > > Hi,
> > >
> > > I'm working through a certain use case where I'm unnesting ListArrays,
> > but
> > > I noticed something peculiar - null ListValues are not retained in the
> > > unnested array.
> > >
> > > E.g.
> > > In [0]: arr = pa.array([[0, 1], [0], None, None])
> > > In [1]: arr.flatten()
> > > Out [1]: [0, 1, 0]
> > >
> > > While I would have expected [0, 1, 0, null, null].
> > >
> > > I should note that this works if the None is encapsulated in a list. So
> > I'm
> > > guessing this is expected logic and if so, what's the reasoning for that?
> > >
> > > Thanks,
> > > Suhail
> >


Re: Build issues on macOS [newbie]

2019-09-30 Thread Wes McKinney
Thanks for letting us know. If there are any improvements we can make
to the developer documentation, please feel free to open a JIRA or a
pull request to fix

On Mon, Sep 30, 2019 at 8:13 AM Tarek Allam Jr.  wrote:
>
>
> Hi Wes,
>
> Thank you very much, that indeed fixed things and allowed me to complete a 
> build.
>
> After running conda install --file ci/conda_env_cpp.yml I was able to get 
> passed
> the above error, but then was faced with the error message akin to that found 
> at
> https://issues.apache.org/jira/browse/ARROW-4935
>
> But this was easily solved with running the suggested solution of
>
> $ cd /Library/Developer/CommandLineTools/Packages/
> $ open macOS_SDK_headers_for_macOS_10.14.pkg
>
> (Just putting links to the errors/issues for my reference)
>
> Thanks again for your help getting me started. I'll now go in search of 
> possible
> areas of where I can contribute!
>
> Cheers,
> Tarek
>
>
> On 2019/09/26 19:23:08, Wes McKinney  wrote:
> > It looks like the development toolchain dependencies in
> > conda_env_cpp.yml aren't installed in your "main" conda environment,
> > e.g.
> >
> > https://github.com/apache/arrow/blob/master/ci/conda_env_cpp.yml#L42
> >
> > You can see what's installed by running "conda list"
> >
> > Note that most of these dependencies are optional, but we provide the
> > env files to simplify general development of the project so
> > contributors aren't struggling to produce comprehensive builds.
> >
> > On Wed, Sep 25, 2019 at 11:33 AM Tarek Allam Jr.  
> > wrote:
> > >
> > > Thanks for the advice Uwe and Neal. I tried your suggestion (as well as 
> > > turning many of the flags to off) but then ran into other errors 
> > > afterwards such as:
> > >
> > > -- Using ZSTD_ROOT: /usr/local/anaconda3/envs/main
> > > CMake Error at 
> > > /usr/local/Cellar/cmake/3.15.3/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:137
> > >  (message):
> > >   Could NOT find ZSTD (missing: ZSTD_LIB ZSTD_INCLUDE_DIR)
> > >   
> > > /usr/local/Cellar/cmake/3.15.3/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:378
> > >  (_FPHSA_FAILURE_MESSAGE)
> > >   cmake_modules/FindZSTD.cmake:61 (find_package_handle_standard_args)
> > >   cmake_modules/ThirdpartyToolchain.cmake:181 (find_package)
> > >   cmake_modules/ThirdpartyToolchain.cmake:2033 (resolve_dependency)
> > >   CMakeLists.txt:412 (include)
> > >
> > > I think I will spend some more time to understand CMAKE better and 
> > > familiarise myself with the codebase more before having another go. 
> > > Hopefully in this time conda-forge would have removed the SDK requirement 
> > > as well which like you say should make things much more similar.
> > >
> > > Thanks again,
> > >
> > > Regards,
> > > Tarek
> > >
> > > On 2019/09/19 16:00:09, "Uwe L. Korn"  wrote:
> > > > Hello Tarek,
> > > >
> > > > this error message is normally the one you get when CONDA_BUILD_SYSROOT 
> > > > doesn't point to your 10.9 SDK. Please delete your build folder again 
> > > > and do `export CONDA_BUILD_SYSROOT=..` immediately before running 
> > > > cmake. Running e.g. a conda install will sadly reset this variable to 
> > > > something different and break the build.
> > > >
> > > > As a sidenote: It looks like in 1-2 months that conda-forge will get 
> > > > rid of the SDK requirement, then this will be a bit simpler.
> > > >
> > > > Cheers
> > > > Uwe
> > > >
> > > > On Thu, Sep 19, 2019, at 5:24 PM, Tarek Allam Jr. wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > Firstly I must apologies if what I put here is extremely trivial, but 
> > > > > I am a
> > > > > complete newcomer to the Apache Arrow project and contributing to 
> > > > > Apache in
> > > > > general, but I am very keen to get involved.
> > > > >
> > > > > I'm hoping to help where I can so I recently attempted to complete a 
> > > > > build
> > > > > following the instructions laid out in the 'Python Development' 
> > > > > section of the
> > > > > documentation here:
> > > > >
> > > > > After completing the steps that specifically uses Conda I was able to 
> > > > > create an
> > > > > environment but when it comes to building I am unable to do so.
> > > > >
> > > > > I am on macOS -- 10.14.6 and as outlined in the docs and here
> > > > > (https://stackoverflow.com/a/55798942/4521950) I used use 10.9.sdk
> > > > > instead
> > > > > of the latest. I have both added this manually using ccmake and also
> > > > > defining it
> > > > > like so:
> > > > >
> > > > > cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
> > > > >   -DCMAKE_INSTALL_LIBDIR=lib \
> > > > >   -DARROW_FLIGHT=ON \
> > > > >   -DARROW_GANDIVA=ON \
> > > > >   -DARROW_ORC=ON \
> > > > >   -DARROW_PARQUET=ON \
> > > > >   -DARROW_PYTHON=ON \
> > > > >   -DARROW_PLASMA=ON \
> > > > >   -DARROW_BUILD_TESTS=ON \
> > > > >   -DCONDA_BUILD_SYSROOT=/opt/MacOSX10.9.sdk \
> > > > >   -DARROW_DEPENDENCY_SOURCE=AUTO \
> > > > >   ..
> > > > >
> > > > > But it seems that 

[jira] [Created] (ARROW-6742) [C++] Remove usage of boost::filesystem::path from arrow/io/hdfs_internal.cc

2019-09-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6742:
---

 Summary: [C++] Remove usage of boost::filesystem::path from 
arrow/io/hdfs_internal.cc
 Key: ARROW-6742
 URL: https://issues.apache.org/jira/browse/ARROW-6742
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


This is the only usage of boost::filesystem in this file. Low priority



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6740) Unable to delete closed MemoryMappedFile on Windows

2019-09-30 Thread Sergey Mozharov (Jira)
Sergey Mozharov created ARROW-6740:
--

 Summary: Unable to delete closed MemoryMappedFile on Windows
 Key: ARROW-6740
 URL: https://issues.apache.org/jira/browse/ARROW-6740
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
 Environment: Windows 10
Reporter: Sergey Mozharov


{code:java}
import os
import pyarrow as pa

# Create a file and memory-map it
file_name = 'path-to-a-new-file'
mmap = pa.create_memory_map(file_name, 100)

# or open an existing file
# file_name = 'path-to-an-existing-file'
# mmap = pa.memory_map(file_name)

# close it
mmap.close()
mmap.closed  # True

# try to delete it (can't delete until the python interpreter is killed)
os.remove(file_name)  # PermissionError

# Note: opening an existing file as `pa.input_stream` works as expected{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-09-30-0

2019-09-30 Thread Krisztián Szűcs
wheel-osx-cp27m

has
filed with a Travis deployment error.
Created a JIRA to resolve it
https://issues.apache.org/jira/browse/ARROW-6739

On Mon, Sep 30, 2019 at 3:32 PM Crossbow  wrote:

>
> Arrow Build Report for Job nightly-2019-09-30-0
>
> All tasks:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0
>
> Failed Tasks:
> - docker-r:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-r
> - docker-spark-integration:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-spark-integration
> - wheel-osx-cp27m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-osx-cp27m
>
> Succeeded Tasks:
> - ubuntu-bionic:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-ubuntu-bionic
> - homebrew-cpp-autobrew:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-homebrew-cpp-autobrew
> - wheel-osx-cp35m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-osx-cp35m
> - ubuntu-xenial:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-ubuntu-xenial
> - docker-rust:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-rust
> - conda-linux-gcc-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-conda-linux-gcc-py36
> - docker-java:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-java
> - centos-6:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-centos-6
> - wheel-osx-cp37m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-osx-cp37m
> - ubuntu-bionic-arm64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-ubuntu-bionic-arm64
> - conda-win-vs2015-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-conda-win-vs2015-py36
> - docker-python-3.7:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-python-3.7
> - wheel-win-cp35m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-appveyor-wheel-win-cp35m
> - debian-stretch:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-debian-stretch
> - homebrew-cpp:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-homebrew-cpp
> - docker-r-conda:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-r-conda
> - docker-r-sanitizer:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-r-sanitizer
> - ubuntu-disco:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-ubuntu-disco
> - docker-pandas-master:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-pandas-master
> - conda-linux-gcc-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-conda-linux-gcc-py37
> - docker-python-3.6:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-python-3.6
> - docker-docs:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-docs
> - ubuntu-disco-arm64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-ubuntu-disco-arm64
> - wheel-manylinux1-cp27m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-manylinux1-cp27m
> - wheel-manylinux2010-cp37m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-manylinux2010-cp37m
> - wheel-manylinux2010-cp27mu:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-manylinux2010-cp27mu
> - conda-win-vs2015-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-conda-win-vs2015-py37
> - wheel-manylinux2010-cp35m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-manylinux2010-cp35m
> - wheel-manylinux1-cp36m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-manylinux1-cp36m
> - centos-7-aarch64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-centos-7-aarch64
> - docker-go:

[jira] [Created] (ARROW-6739) [Packaging][Crossbow] Use crossbow.py upload-artifacts across all CI providers

2019-09-30 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-6739:
--

 Summary: [Packaging][Crossbow] Use crossbow.py upload-artifacts 
across all CI providers
 Key: ARROW-6739
 URL: https://issues.apache.org/jira/browse/ARROW-6739
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs


Appvevor and Travis occasionally timeouts and fails during artifact uploading 
at the end of the builds. Crossbow got enhanced with the same functionality to 
properly work on azure pipelines.
We should use the same crossbow script uniformly on all CI providers to prevent 
upload timeouts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2019-09-30-0

2019-09-30 Thread Crossbow


Arrow Build Report for Job nightly-2019-09-30-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0

Failed Tasks:
- docker-r:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-r
- docker-spark-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-spark-integration
- wheel-osx-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-osx-cp27m

Succeeded Tasks:
- ubuntu-bionic:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-ubuntu-bionic
- homebrew-cpp-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-homebrew-cpp-autobrew
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-osx-cp35m
- ubuntu-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-ubuntu-xenial
- docker-rust:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-rust
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-conda-linux-gcc-py36
- docker-java:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-java
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-centos-6
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-osx-cp37m
- ubuntu-bionic-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-ubuntu-bionic-arm64
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-conda-win-vs2015-py36
- docker-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-python-3.7
- wheel-win-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-appveyor-wheel-win-cp35m
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-debian-stretch
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-homebrew-cpp
- docker-r-conda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-r-conda
- docker-r-sanitizer:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-r-sanitizer
- ubuntu-disco:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-ubuntu-disco
- docker-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-pandas-master
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-conda-linux-gcc-py37
- docker-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-python-3.6
- docker-docs:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-docs
- ubuntu-disco-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-ubuntu-disco-arm64
- wheel-manylinux1-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-manylinux1-cp27m
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-manylinux2010-cp37m
- wheel-manylinux2010-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-manylinux2010-cp27mu
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-conda-win-vs2015-py37
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-manylinux2010-cp35m
- wheel-manylinux1-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-manylinux1-cp36m
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-azure-centos-7-aarch64
- docker-go:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-go
- docker-js:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-circle-docker-js
- wheel-manylinux1-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-30-0-travis-wheel-manylinux1-cp27mu
- centos-7:
  URL: 

Re: Build issues on macOS [newbie]

2019-09-30 Thread Tarek Allam Jr .


Hi Wes,

Thank you very much, that indeed fixed things and allowed me to complete a 
build.

After running conda install --file ci/conda_env_cpp.yml I was able to get passed
the above error, but then was faced with the error message akin to that found at
https://issues.apache.org/jira/browse/ARROW-4935

But this was easily solved with running the suggested solution of

$ cd /Library/Developer/CommandLineTools/Packages/
$ open macOS_SDK_headers_for_macOS_10.14.pkg

(Just putting links to the errors/issues for my reference)

Thanks again for your help getting me started. I'll now go in search of possible
areas of where I can contribute!

Cheers, 
Tarek


On 2019/09/26 19:23:08, Wes McKinney  wrote: 
> It looks like the development toolchain dependencies in
> conda_env_cpp.yml aren't installed in your "main" conda environment,
> e.g.
> 
> https://github.com/apache/arrow/blob/master/ci/conda_env_cpp.yml#L42
> 
> You can see what's installed by running "conda list"
> 
> Note that most of these dependencies are optional, but we provide the
> env files to simplify general development of the project so
> contributors aren't struggling to produce comprehensive builds.
> 
> On Wed, Sep 25, 2019 at 11:33 AM Tarek Allam Jr.  wrote:
> >
> > Thanks for the advice Uwe and Neal. I tried your suggestion (as well as 
> > turning many of the flags to off) but then ran into other errors afterwards 
> > such as:
> >
> > -- Using ZSTD_ROOT: /usr/local/anaconda3/envs/main
> > CMake Error at 
> > /usr/local/Cellar/cmake/3.15.3/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:137
> >  (message):
> >   Could NOT find ZSTD (missing: ZSTD_LIB ZSTD_INCLUDE_DIR)
> >   
> > /usr/local/Cellar/cmake/3.15.3/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:378
> >  (_FPHSA_FAILURE_MESSAGE)
> >   cmake_modules/FindZSTD.cmake:61 (find_package_handle_standard_args)
> >   cmake_modules/ThirdpartyToolchain.cmake:181 (find_package)
> >   cmake_modules/ThirdpartyToolchain.cmake:2033 (resolve_dependency)
> >   CMakeLists.txt:412 (include)
> >
> > I think I will spend some more time to understand CMAKE better and 
> > familiarise myself with the codebase more before having another go. 
> > Hopefully in this time conda-forge would have removed the SDK requirement 
> > as well which like you say should make things much more similar.
> >
> > Thanks again,
> >
> > Regards,
> > Tarek
> >
> > On 2019/09/19 16:00:09, "Uwe L. Korn"  wrote:
> > > Hello Tarek,
> > >
> > > this error message is normally the one you get when CONDA_BUILD_SYSROOT 
> > > doesn't point to your 10.9 SDK. Please delete your build folder again and 
> > > do `export CONDA_BUILD_SYSROOT=..` immediately before running cmake. 
> > > Running e.g. a conda install will sadly reset this variable to something 
> > > different and break the build.
> > >
> > > As a sidenote: It looks like in 1-2 months that conda-forge will get rid 
> > > of the SDK requirement, then this will be a bit simpler.
> > >
> > > Cheers
> > > Uwe
> > >
> > > On Thu, Sep 19, 2019, at 5:24 PM, Tarek Allam Jr. wrote:
> > > >
> > > > Hi all,
> > > >
> > > > Firstly I must apologies if what I put here is extremely trivial, but I 
> > > > am a
> > > > complete newcomer to the Apache Arrow project and contributing to 
> > > > Apache in
> > > > general, but I am very keen to get involved.
> > > >
> > > > I'm hoping to help where I can so I recently attempted to complete a 
> > > > build
> > > > following the instructions laid out in the 'Python Development' section 
> > > > of the
> > > > documentation here:
> > > >
> > > > After completing the steps that specifically uses Conda I was able to 
> > > > create an
> > > > environment but when it comes to building I am unable to do so.
> > > >
> > > > I am on macOS -- 10.14.6 and as outlined in the docs and here
> > > > (https://stackoverflow.com/a/55798942/4521950) I used use 10.9.sdk
> > > > instead
> > > > of the latest. I have both added this manually using ccmake and also
> > > > defining it
> > > > like so:
> > > >
> > > > cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
> > > >   -DCMAKE_INSTALL_LIBDIR=lib \
> > > >   -DARROW_FLIGHT=ON \
> > > >   -DARROW_GANDIVA=ON \
> > > >   -DARROW_ORC=ON \
> > > >   -DARROW_PARQUET=ON \
> > > >   -DARROW_PYTHON=ON \
> > > >   -DARROW_PLASMA=ON \
> > > >   -DARROW_BUILD_TESTS=ON \
> > > >   -DCONDA_BUILD_SYSROOT=/opt/MacOSX10.9.sdk \
> > > >   -DARROW_DEPENDENCY_SOURCE=AUTO \
> > > >   ..
> > > >
> > > > But it seems that whatever I try, I seem to get errors, the main only 
> > > > tripping
> > > > me up at the moment is:
> > > >
> > > > -- Building using CMake version: 3.15.3
> > > > -- The C compiler identification is Clang 4.0.1
> > > > -- The CXX compiler identification is Clang 4.0.1
> > > > -- Check for working C compiler:
> > > > /usr/local/anaconda3/envs/pyarrow-dev/bin/clang
> > > > -- Check for working C compiler:
> > > > 

[jira] [Created] (ARROW-6738) [Java] Fix problems with current union comparison logic

2019-09-30 Thread Liya Fan (Jira)
Liya Fan created ARROW-6738:
---

 Summary: [Java] Fix problems with current union comparison logic
 Key: ARROW-6738
 URL: https://issues.apache.org/jira/browse/ARROW-6738
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


There are some problems with the current union comparison logic. For example:
1. For type check, we should not require fields to be equal. It is possible 
that two vectors' value ranges are equal but their fields are different.
2. We should not compare the number of sub vectors, as it is possible that two 
union vectors have different numbers of sub vectors, but have equal values in 
the range.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS][Java] Reduce the range of synchronized block when releasing an ArrowBuf

2019-09-30 Thread Antoine Pitrou


I will just point out that using an atomic counter or boolean /outside/
of a locked section is a common pattern in C++.  The benefit comes up if
the locked section is conditional and the condition is rarely true.

Regards

Antoine.


Le 30/09/2019 à 06:24, Jacques Nadeau a écrit :
> For others that don't realize, the discussion of this is happening on the
> pull request here:
> 
> https://github.com/apache/arrow/pull/5526
> 
> On Fri, Sep 27, 2019 at 4:52 AM Fan Liya  wrote:
> 
>> Dear all,
>>
>> When releasing an ArrowBuf, we will run the following piece of code:
>>
>> private int decrement(int decrement) {
>>   allocator.assertOpen();
>>   final int outcome;
>>   synchronized (allocationManager) {
>> outcome = bufRefCnt.addAndGet(-decrement);
>>   if (outcome == 0) {
>> lDestructionTime = System.nanoTime();
>> allocationManager.release(this);
>>   }
>>
>>   }
>>   return outcome;
>> }
>>
>> It can be seen that we need to acquire the lock for allocation manager
>> lock, no matter if we need to release the buffer. In addition, the
>> operation of decrementing refcount is only carried out after the lock is
>> acquired. This leads to unnecessary lock contention, and may degrade
>> performance.
>>
>> We propose to change the code like this:
>>
>> private int decrement(int decrement) {
>>   allocator.assertOpen();
>>   final int outcome;
>>   outcome = bufRefCnt.addAndGet(-decrement);
>>   if (outcome == 0) {
>> lDestructionTime = System.nanoTime();
>> synchronized (allocationManager) {
>>  allocationManager.release(this);
>> }
>>   }
>>   return outcome;
>> }
>>
>> Note that this change can be dangerous, as it lies in the core of our code
>> base, so we should be careful with it. On the other hand, it may have
>> non-trivial performance implication. For example, when a distributed task
>> is getting closed, a large number of ArrowBuf will be closed
>> simultaneously. If we reduce the range of the synchronization block, we can
>> significantly improve the performance.
>>
>> Would you please give your valueable feedback?
>>
>>
>> Best,
>>
>> Liya Fan
>>
> 


Re: Subject: [VOTE] Release Apache Arrow 0.15.0 - RC1

2019-09-30 Thread Krisztián Szűcs
On Mon, Sep 30, 2019 at 12:27 AM Wes McKinney  wrote:

> OK. I think an RC2 can be based off of the current master branch to
> make things simple. Do any more patches need to be cherry-picked?
> There are some other C# protocol-related bug fixes but they seem
> incomplete
>
I'll use the master branch then. I'll need some JIRA gardening to update
the
version numbers for JIRAs merged after the RC1. I can start to cut RC2 later
today, after 12:00 UTC.

>
> On Sun, Sep 29, 2019 at 3:23 PM Micah Kornfield 
> wrote:
> >
> > Krisztián do you have availability to cut a new RC (I won't be able to
> sort
> > out my key issues for at least a week)?
> >
> > To answer Wes's question earlier in the thread it would be nice to
> include
> > the C# fix for file formats [1].
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> https://github.com/apache/arrow/commit/5fec051d93a0e647c2330564d6d9eb0c7c896e48
> >
> > On Sun, Sep 29, 2019 at 11:30 AM Micah Kornfield 
> > wrote:
> >
> > > Im going to vote -1 on this RC.  I think we should cut a new RC with
> > > someone who has verified key (I think it is the best thing from a
> security
> > > stand point).  I didn't realize the noise that would be caused by lack
> of
> > > cross signing.  I apologise for the churn.
> > >
> > > On Sunday, September 29, 2019, Andy Grove 
> wrote:
> > >
> > >> Just fyi on the rustfmt issue, the formatting was recently updated for
> > >> rust
> > >> 1.40 nightly and if you are using an older version the formatting
> check
> > >> will fail.
> > >>
> > >> On Sun, Sep 29, 2019, 5:56 AM Wes McKinney 
> wrote:
> > >>
> > >> > It's up to Micah as RM, but I think it would be good to fix the
> > >> sig-related
> > >> > issues or we may be dealing with "bug" reports until the next
> release.
> > >> I'll
> > >> > work on source verification later today in the meantime to see if
> any
> > >> other
> > >> > issues turn up
> > >> >
> > >> > On Sun, Sep 29, 2019, 1:19 AM Sutou Kouhei 
> wrote:
> > >> >
> > >> > > -0 (binding)
> > >> > >
> > >> > > I ran the followings on Debian GNU/Linux sid:
> > >> > >
> > >> > >   * TEST_CSHARP=0 \
> > >> > >   TEST_GLIB=0 \
> > >> > >   TEST_RUBY=0 \
> > >> > >   TEST_RUST=0 \
> > >> > >   JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
> > >> > >   CUDA_TOOLKIT_ROOT=/usr \
> > >> > > dev/release/verify-release-candidate.sh source 0.15.0 1
> > >> > >   * dev/release/verify-release-candidate.sh binaries 0.14.1 0
> > >> > >
> > >> > > with:
> > >> > >
> > >> > >   * gcc (Debian 9.2.1-7) 9.2.1
> > >> > >   * openjdk version "1.8.0_212"
> > >> > >   * Node.JS v12.1.0
> > >> > >   * go version go1.12.9 linux/amd64
> > >> > >   * nvidia-cuda-dev 10.1.105-3
> > >> > >
> > >> > >
> > >> > > I got the following failures:
> > >> > >
> > >> > >   * Not ignorable:
> > >> > > * Binary: Bad signature
> > >> > >   * centos-rc/6/Source/repodata/repomd.xml is failed
> > >> > >   * We can't ignore this if removing the file from
> > >> > > https://bintray.com/apache/arrow/centos-rc/0.15.0-rc1 and
> > >> > > re-uploading it doesn't solve this problem.
> > >> > >
> > >> > >   * Ignorable:
> > >> > > * C GLib and Ruby: Buildable but can't run test with GLib
> 2.62.0.
> > >> > >   * It's caused by gobject-introspection gem.
> > >> > >   * This is a known problem and not a C GLib problem.
> > >> > >   * We can ignore this. (I'm fixing gobject-introspection
> gem.)
> > >> > > * Rust: "cargo +stable fmt --all -- --check" is failed (*)
> > >> > >   * If I commented the command line out, Rust verification is
> > >> passed.
> > >> > >   * We can ignore this. Because this is just a lint error.
> > >> > > * C#: "sourcelink test" is failed
> > >> > >   * We can ignore this. This is happened when we release
> 0.14.1
> > >> too.
> > >> > > * APT and Yum: arm64 and aarch64 are broken
> > >> > >   * We can ignore this.
> > >> > >
> > >> > > (*)
> > >> > > 
> > >> > > + cargo +stable fmt --all -- --check
> > >> > > Diff in
> > >> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/arrow/src/array/
> > >> > > builder.rs at line 1458:
> > >> > >  let mut builder = StructBuilder::new(fields,
> field_builders);
> > >> > >
> assert!(builder.field_builder::(0).is_none());
> > >> > >  }
> > >> > > -
> > >> > >  }
> > >> > >
> > >> > > Diff in
> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/arrow/src/
> > >> > > bitmap.rs at line 126:
> > >> > >  assert_eq!(true, bitmap.is_set(6));
> > >> > >  assert_eq!(false, bitmap.is_set(7));
> > >> > >  }
> > >> > > -
> > >> > >  }
> > >> > >
> > >> > > Diff in
> > >> > >
> > >> >
> > >>
> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/datafusion/src/execution/
> > >> > > aggregate.rs at line 1471:
> > >> > >  ds,
> > >> > >  )
> > >> > >  }
> > >> > > -
> > >> > >  }
> > >> > >
> > >> > > Diff in
> > >> > >
> > >> >
> > >>
> 

Re: Subject: [VOTE] Release Apache Arrow 0.15.0 - RC1

2019-09-30 Thread Krisztián Szűcs
Hey Micah!

On Sun, Sep 29, 2019 at 10:23 PM Micah Kornfield 
wrote:

> Krisztián do you have availability to cut a new RC (I won't be able to
> sort out my key issues for at least a week)?
>
Yes, I can cut RC2 later today.

>
> To answer Wes's question earlier in the thread it would be nice to include
> the C# fix for file formats [1].
>
> Thanks,
> Micah
>
> [1]
> https://github.com/apache/arrow/commit/5fec051d93a0e647c2330564d6d9eb0c7c896e48
>
> On Sun, Sep 29, 2019 at 11:30 AM Micah Kornfield 
> wrote:
>
>> Im going to vote -1 on this RC.  I think we should cut a new RC with
>> someone who has verified key (I think it is the best thing from a security
>> stand point).  I didn't realize the noise that would be caused by lack of
>> cross signing.  I apologise for the churn.
>>
>> On Sunday, September 29, 2019, Andy Grove  wrote:
>>
>>> Just fyi on the rustfmt issue, the formatting was recently updated for
>>> rust
>>> 1.40 nightly and if you are using an older version the formatting check
>>> will fail.
>>>
>>> On Sun, Sep 29, 2019, 5:56 AM Wes McKinney  wrote:
>>>
>>> > It's up to Micah as RM, but I think it would be good to fix the
>>> sig-related
>>> > issues or we may be dealing with "bug" reports until the next release.
>>> I'll
>>> > work on source verification later today in the meantime to see if any
>>> other
>>> > issues turn up
>>> >
>>> > On Sun, Sep 29, 2019, 1:19 AM Sutou Kouhei  wrote:
>>> >
>>> > > -0 (binding)
>>> > >
>>> > > I ran the followings on Debian GNU/Linux sid:
>>> > >
>>> > >   * TEST_CSHARP=0 \
>>> > >   TEST_GLIB=0 \
>>> > >   TEST_RUBY=0 \
>>> > >   TEST_RUST=0 \
>>> > >   JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
>>> > >   CUDA_TOOLKIT_ROOT=/usr \
>>> > > dev/release/verify-release-candidate.sh source 0.15.0 1
>>> > >   * dev/release/verify-release-candidate.sh binaries 0.14.1 0
>>> > >
>>> > > with:
>>> > >
>>> > >   * gcc (Debian 9.2.1-7) 9.2.1
>>> > >   * openjdk version "1.8.0_212"
>>> > >   * Node.JS v12.1.0
>>> > >   * go version go1.12.9 linux/amd64
>>> > >   * nvidia-cuda-dev 10.1.105-3
>>> > >
>>> > >
>>> > > I got the following failures:
>>> > >
>>> > >   * Not ignorable:
>>> > > * Binary: Bad signature
>>> > >   * centos-rc/6/Source/repodata/repomd.xml is failed
>>> > >   * We can't ignore this if removing the file from
>>> > > https://bintray.com/apache/arrow/centos-rc/0.15.0-rc1 and
>>> > > re-uploading it doesn't solve this problem.
>>> > >
>>> > >   * Ignorable:
>>> > > * C GLib and Ruby: Buildable but can't run test with GLib 2.62.0.
>>> > >   * It's caused by gobject-introspection gem.
>>> > >   * This is a known problem and not a C GLib problem.
>>> > >   * We can ignore this. (I'm fixing gobject-introspection gem.)
>>> > > * Rust: "cargo +stable fmt --all -- --check" is failed (*)
>>> > >   * If I commented the command line out, Rust verification is
>>> passed.
>>> > >   * We can ignore this. Because this is just a lint error.
>>> > > * C#: "sourcelink test" is failed
>>> > >   * We can ignore this. This is happened when we release 0.14.1
>>> too.
>>> > > * APT and Yum: arm64 and aarch64 are broken
>>> > >   * We can ignore this.
>>> > >
>>> > > (*)
>>> > > 
>>> > > + cargo +stable fmt --all -- --check
>>> > > Diff in
>>> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/arrow/src/array/
>>> > > builder.rs at line 1458:
>>> > >  let mut builder = StructBuilder::new(fields,
>>> field_builders);
>>> > >
>>> assert!(builder.field_builder::(0).is_none());
>>> > >  }
>>> > > -
>>> > >  }
>>> > >
>>> > > Diff in /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/arrow/src/
>>> > > bitmap.rs at line 126:
>>> > >  assert_eq!(true, bitmap.is_set(6));
>>> > >  assert_eq!(false, bitmap.is_set(7));
>>> > >  }
>>> > > -
>>> > >  }
>>> > >
>>> > > Diff in
>>> > >
>>> >
>>> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/datafusion/src/execution/
>>> > > aggregate.rs at line 1471:
>>> > >  ds,
>>> > >  )
>>> > >  }
>>> > > -
>>> > >  }
>>> > >
>>> > > Diff in
>>> > >
>>> >
>>> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/datafusion/src/execution/
>>> > > context.rs at line 682:
>>> > >
>>> > >  Ok(ctx)
>>> > >  }
>>> > > -
>>> > >  }
>>> > >
>>> > > Diff in
>>> > >
>>> >
>>> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/datafusion/src/execution/physical_plan/
>>> > > hash_aggregate.rs at line 720:
>>> > >
>>> > >  Ok(())
>>> > >  }
>>> > > -
>>> > >  }
>>> > >
>>> > > Diff in
>>> > >
>>> >
>>> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/datafusion/src/execution/physical_plan/
>>> > > merge.rs at line 134:
>>> > >
>>> > >  Ok(())
>>> > >  }
>>> > > -
>>> > >  }
>>> > >
>>> > > Diff in
>>> > >
>>> >
>>> /tmp/arrow-0.15.0.tGMnP/apache-arrow-0.15.0/rust/datafusion/src/execution/physical_plan/
>>> > > projection.rs at line 171:
>>> > >
>>> > >  Ok(())
>>>