date:20190919

[jira] [Created] (ARROW-6641) Remove Deprecated WriteableFile warning

2019-09-19 Thread Karthikeyan Natarajan (Jira)

Karthikeyan Natarajan created ARROW-6641:


 Summary: Remove Deprecated WriteableFile warning
 Key: ARROW-6641
 URL: https://issues.apache.org/jira/browse/ARROW-6641
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.14.1, 0.14.0
Reporter: Karthikeyan Natarajan


Current version is 0.14.1. As per comment, deprecated `WriteableFile` should 
have been removed. 

 
{code:java}
// TODO(kszucs): remove this after 0.13
#ifndef _MSC_VER
using WriteableFile ARROW_DEPRECATED("Use WritableFile") = WritableFile;
using ReadableFileInterface ARROW_DEPRECATED("Use RandomAccessFile") = 
RandomAccessFile;
#else
// MSVC does not like using ARROW_DEPRECATED with using declarations
using WriteableFile = WritableFile;
using ReadableFileInterface = RandomAccessFile;
#endif
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6640) [C++]Error when BufferedInputStream Peek more than bytes buffered

2019-09-19 Thread Zherui Cao (Jira)

Zherui Cao created ARROW-6640:
-

 Summary: [C++]Error when BufferedInputStream Peek more than bytes 
buffered
 Key: ARROW-6640
 URL: https://issues.apache.org/jira/browse/ARROW-6640
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zherui Cao
Assignee: Zherui Cao


An example:

BufferedInputStream:Peek(10), but only 8 buffered remaining (buffer_pos is 2 
right now)

it will increase the buffer size by 2. In the mean time the buffer_pos will be 
reset to 0, but it should remain 2.

Resetting buffer_pos will cause problems.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6639) [Packaging] Improve i386 support with Yum task

2019-09-19 Thread Kentaro Hayashi (Jira)

Kentaro Hayashi created ARROW-6639:
--

 Summary: [Packaging] Improve i386 support with Yum task
 Key: ARROW-6639
 URL: https://issues.apache.org/jira/browse/ARROW-6639
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Kentaro Hayashi


apt:build rake task supports architecture to run [1], but it is not true
 for yum task.

 [1] 
[https://github.com/apache/arrow/blob/master/dev/tasks/linux-packages/package-task.rb#L276]

It is useful yum task also supports architecture (ex. i386) too. (even though 
CentOS 6 i386 EOL reaches 2020/11)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6638) [C++] Set ARROW_JEMALLOC=off by default

2019-09-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6638:
---

 Summary: [C++] Set ARROW_JEMALLOC=off by default
 Key: ARROW-6638
 URL: https://issues.apache.org/jira/browse/ARROW-6638
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Enabling jemalloc is relevant for developers and packagers, who will want to 
use this allocator to achieve much better performance. We should very clearly 
advise average users of Apache Arrow to build core libraries with jemalloc 
inside but not necessarily force its use out of the box



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-19 Thread Wes McKinney

hi Micah,


On Thu, Sep 19, 2019 at 12:41 AM Micah Kornfield  wrote:
>
> >
> > * Should optional components be "opt in", "out out", or a mix?
> > Currently it's a mix, and that's confusing for people. I think we
> > should make them all "opt in".
>
> Agreed they should all be opt in by default.  I think active developer are
> quite adept at flipping the appropriate CMake flags.
>

Cool. I opened a tracking JIRA
https://issues.apache.org/jira/browse/ARROW-6637 and attached many
issues. Sorry for the new JIRA flood

>
> > * Do we want to bring the out-of-the-box core build down to zero
> > dependencies, including not depending on boost::filesystem and
> > possibly checking the compiled Flatbuffers files.
>
>  While it may be
> > slightly more maintenance work, I think the optics of a
> > "dependency-free" core build would be beneficial and help the project
> > marketing-wise.
>
> I'm -.5 on checking in generated artifacts but this is mostly stylistic.
> In the case of flatbuffers it seems like we might be able to get-away with
> vendoring since it should mostly be headers only.
>
> I would prefer to try come up with more granular components and be
> very conservative on what is "core".  I think it should be possible have a
> zero dependency build if only MemoryPool, Buffers, Arrays and ArrayBuilders
> in a core package [1].  This combined with discussion Antoine started on an
> ABI compatible C-layer would make basic inter-op within a process
> reasonable.  Moving up the stack to IPC and files, there is probably a way
> to package headers separately from implementations.  This would allow other
> projects wishing to integrate with Arrow to bring their own implementations
> without the baggage of boost::filesystem. Would this leave anything besides
> "flatbuffers" as a hard dependency to support IPC?
>

We could indeed split up libarrow into more shared libraries. This
would mean accepting a lot more maintenance effort though, on a team
that is already overburdened. I'm not too keen on that in the short
term.

> Thanks,
> Micah
>
>
> [1] It probably makes sense to go even further and separate out MemoryPool
> and Buffer, so we can break the circular relationship between parquet and
> arrow.

Don't think this is possible even then, particularly in light of my
recent work reading and writing Arrow columnar data "closer to the
metal"  inside Parquet, yielding beneficial performance improvements.

>
> On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney  wrote:
>
> > To be clear I think we should make these changes right after 0.15.0 is
> > released so we aren't playing whackamole with our packaging scripts.
> > I'm happy to take the lead on the work...
> >
> > On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
> > wrote:
> > >
> > > On Wed, 18 Sep 2019 09:46:54 -0500
> > > Wes McKinney  wrote:
> > > > I think these are both interesting areas to explore further. I'd like
> > > > to focus on the couple of immediate items I think we should address
> > > >
> > > > * Should optional components be "opt in", "out out", or a mix?
> > > > Currently it's a mix, and that's confusing for people. I think we
> > > > should make them all "opt in".
> > > > * Do we want to bring the out-of-the-box core build down to zero
> > > > dependencies, including not depending on boost::filesystem and
> > > > possibly checking the compiled Flatbuffers files. While it may be
> > > > slightly more maintenance work, I think the optics of a
> > > > "dependency-free" core build would be beneficial and help the project
> > > > marketing-wise.
> > > >
> > > > Both of these issues must be addressed whether we undertake a Bazel
> > > > implementation or some other refactor of the C++ build system.
> > >
> > > I think checking in the Flatbuffers files (and also Protobuf and Thrift
> > > where applicable :-)) would be fine.
> > >
> > > As for boost::filesystem, getting rid of it wouldn't be a huge task.
> > > Still worth deciding whether we want to prioritize development time for
> > > it, because it's not entirely trivial either.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> >

[jira] [Created] (ARROW-6637) [C++] Zero-dependency default core build

2019-09-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6637:
---

 Summary: [C++] Zero-dependency default core build
 Key: ARROW-6637
 URL: https://issues.apache.org/jira/browse/ARROW-6637
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This is a tracking JIRA for items relating to having few or no dependencies for 
minimal out-of-the-box builds



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6636) [C++] Do not build C++ command line utilities by default

2019-09-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6636:
---

 Summary: [C++] Do not build C++ command line utilities by default
 Key: ARROW-6636
 URL: https://issues.apache.org/jira/browse/ARROW-6636
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This means to change {{ARROW_BUILD_UTILITIES}} to be off by default. These are 
mostly used for integration testing, so building unit or integration tests 
should toggle this on automatically. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6635) [C++] Do not require glog for default build

2019-09-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6635:
---

 Summary: [C++] Do not require glog for default build
 Key: ARROW-6635
 URL: https://issues.apache.org/jira/browse/ARROW-6635
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


We should change the default for {{ARROW_USE_GLOG}} to be off



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6634) [C++] Do not require flatbuffers or flatbuffers_ep to build

2019-09-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6634:
---

 Summary: [C++] Do not require flatbuffers or flatbuffers_ep to 
build
 Key: ARROW-6634
 URL: https://issues.apache.org/jira/browse/ARROW-6634
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Flatbuffers is small enough that we can vendor {{flatbuffers/flatbuffers.h}} 
and check in the compiled files to make flatbuffers_ep unneeded



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6633) [C++] Do not require double-conversion for default build

2019-09-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6633:
---

 Summary: [C++] Do not require double-conversion for default build
 Key: ARROW-6633
 URL: https://issues.apache.org/jira/browse/ARROW-6633
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This library is only needed in core builds if

* ARROW_JSON=on or
* ARROW_CSV=on (option to be added) or
* ARROW_BUILD_TESTS=on 

The double conversion headers leak into 

* arrow/util/decimal.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6632) [C++] Do not build with ARROW_COMPUTE=on and ARROW_DATASET=on by default

2019-09-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6632:
---

 Summary: [C++] Do not build with ARROW_COMPUTE=on and 
ARROW_DATASET=on by default
 Key: ARROW-6632
 URL: https://issues.apache.org/jira/browse/ARROW-6632
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


In addition to being more time-consuming to build, some "core" users will not 
need these functions, so it would be better to opt in to these



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6631) [C++] Do not build with any compression library dependencies by default

2019-09-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6631:
---

 Summary: [C++] Do not build with any compression library 
dependencies by default
 Key: ARROW-6631
 URL: https://issues.apache.org/jira/browse/ARROW-6631
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Numerous packaging scripts will have to be updated if we decide to do this. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6630) [Doc][C++] Document the file readers (CSV, JSON, Parquet, etc.)

2019-09-19 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-6630:
--

 Summary: [Doc][C++] Document the file readers (CSV, JSON, Parquet, 
etc.)
 Key: ARROW-6630
 URL: https://issues.apache.org/jira/browse/ARROW-6630
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: Neal Richardson
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6629) [Doc][C++] Document the FileSystem API

2019-09-19 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-6629:
--

 Summary: [Doc][C++] Document the FileSystem API
 Key: ARROW-6629
 URL: https://issues.apache.org/jira/browse/ARROW-6629
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: Neal Richardson
 Fix For: 1.0.0


In ARROW-6622, I was looking for a place in the docs to add about path 
normalization, and I couldn't find filesystem docs at all. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6628) [C++] Support dictionary unification on dictionaries having nulls

2019-09-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6628:
---

 Summary: [C++] Support dictionary unification on dictionaries 
having nulls
 Key: ARROW-6628
 URL: https://issues.apache.org/jira/browse/ARROW-6628
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Follow up to ARROW-5343



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6626) [Python] Handle "set" values as lists when converting to Arrow

2019-09-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6626:
---

 Summary: [Python] Handle "set" values as lists when converting to 
Arrow
 Key: ARROW-6626
 URL: https://issues.apache.org/jira/browse/ARROW-6626
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


See current behavior

{code}
In [1]: pa.array([{1,2, 3}])
   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pa.array([{1,2, 3}])

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Could not convert {1, 2, 3} with type set: did not recognize 
Python value type when inferring an Arrow data type
In ../src/arrow/python/iterators.h, line 70, code: func(value, 
static_cast(i), _going)
In ../src/arrow/python/inference.cc, line 621, code: 
inferrer.VisitSequence(obj, mask)
In ../src/arrow/python/python_to_arrow.cc, line 1074, code: InferArrowType(seq, 
mask, options.from_pandas, _type)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6625) Allow concat_tables to null or default fill missing columns

2019-09-19 Thread Daniel Nugent (Jira)

Daniel Nugent created ARROW-6625:


 Summary: Allow concat_tables to null or default fill missing 
columns
 Key: ARROW-6625
 URL: https://issues.apache.org/jira/browse/ARROW-6625
 Project: Apache Arrow
  Issue Type: Wish
  Components: Python
Reporter: Daniel Nugent


The concat_tables function currently requires schemas to be identical across 
all tables to be concat'ed together. However, tables occasionally are 
conforming on type where present, but a column will be absent.

In this case, allowing for null filling (or default filling) would be ideal.

I imagine this feature would be an optional parameter on the concat_tables 
function. Presumably the argument could be either a boolean in the case of 
blanket null filling, or a mapping type for default filling. If a user wanted 
to default fill some columns, but null fill others, they could use a None as 
the value (defaultdict would make it simple to provide a blanket null fill if 
only a few default value columns were desired).

If a mapping wasn't present, the function should probably raise an error.

The default behavior would be the current and thus the default value of the 
parameter should be False or None.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Wes McKinney

This is helpful, I will leave some comments on the proposal when I
can, sometime in the next week.

I agree that it would likely be opening a can of worms to create a
semantic mapping between a generalized type grammar and Arrow's
specific logical types defined in Schema.fbs. If we go down this
route, we should probably utilize the simplest possible grammar that
is capable of encoding the Type Flatbuffers union values.

On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou  wrote:
>
>
> I've posted a draft specification PR here, this should help orient the
> discussion a bit:
> https://github.com/apache/arrow/pull/5442
>
> Regards
>
> Antoine.
>
>
>
> On Wed, 18 Sep 2019 19:52:38 +0200
> Antoine Pitrou  wrote:
> > Hello,
> >
> > One thing that was discussed in the sync call is the ability to easily
> > pass arrays at runtime between Arrow implementations or Arrow-supporting
> > libraries in the same process, without bearing the cost of linking to
> > e.g. the C++ Arrow library.
> >
> > (for example: "Duckdb wants to provide an option to return Arrow data of
> > result sets, but they don't like having Arrow as a dependency")
> >
> > One possibility would be to define a C-level protocol similar in spirit
> > to the Python buffer protocol, which some people may be familiar with (*).
> >
> > The basic idea is to define a simple C struct, which is ABI-stable and
> > describes an Arrow away adequately.  The struct can be stack-allocated.
> > Its definition can also be copied in another project (or interfaced with
> > using a C FFI layer, depending on the language).
> >
> > There is no formal proposal, this message is meant to stir the discussion.
> >
> > Issues to work out:
> >
> > * Memory lifetime issues: where Python simply associates the Py_buffer
> > with a PyObject owner (a garbage-collected Python object), we need
> > another means to control lifetime of pointed areas.  One simple
> > possibility is to include a destructor function pointer in the protocol
> > struct.
> >
> > * Arrow type representation.  We probably need some kind of "format"
> > mini-language to represent Arrow types, so that a type can be described
> > using a `const char*`.  Ideally, primitives types at least should be
> > trivially parsable.  We may take inspiration from Python here (`struct`
> > module format characters, PEP 3118 format additions).
> >
> > Example C struct definition (not a formal proposal!):
> >
> > struct ArrowBuffer {
> >   void* data;
> >   int64_t nbytes;
> >   // Called by the consumer when it doesn't need the buffer anymore
> >   void (*release)(struct ArrowBuffer*);
> >   // Opaque user data (for e.g. the release callback)
> >   void* user_data;
> > };
> >
> > struct ArrowArray {
> >   // Type description
> >   const char* format;
> >   // Data description
> >   int64_t length;
> >   int64_t null_count;
> >   int64_t n_buffers;
> >   // Note: this pointers are probably owned by the ArrowArray struct
> >   // and will be released and free()ed by the release callback.
> >   struct BufferDescriptor* buffers;
> >   struct ArrowDescriptor* dictionary;
> >   // Called by the consumer when it doesn't need the array anymore
> >   void (*release)(struct ArrowArrayDescriptor*);
> >   // Opaque user data (for e.g. the release callback)
> >   void* user_data;
> > };
> >
> > Thoughts?
> >
> > (*) For the record, the reference for the Python buffer protocol:
> > https://docs.python.org/3/c-api/buffer.html#buffer-structure
> > and its C struct definition:
> > https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
> >
> > Regards
> >
> > Antoine.
> >
>
>
>

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Antoine Pitrou



I've posted a draft specification PR here, this should help orient the
discussion a bit:
https://github.com/apache/arrow/pull/5442

Regards

Antoine.



On Wed, 18 Sep 2019 19:52:38 +0200
Antoine Pitrou  wrote:
> Hello,
> 
> One thing that was discussed in the sync call is the ability to easily
> pass arrays at runtime between Arrow implementations or Arrow-supporting
> libraries in the same process, without bearing the cost of linking to
> e.g. the C++ Arrow library.
> 
> (for example: "Duckdb wants to provide an option to return Arrow data of
> result sets, but they don't like having Arrow as a dependency")
> 
> One possibility would be to define a C-level protocol similar in spirit
> to the Python buffer protocol, which some people may be familiar with (*).
> 
> The basic idea is to define a simple C struct, which is ABI-stable and
> describes an Arrow away adequately.  The struct can be stack-allocated.
> Its definition can also be copied in another project (or interfaced with
> using a C FFI layer, depending on the language).
> 
> There is no formal proposal, this message is meant to stir the discussion.
> 
> Issues to work out:
> 
> * Memory lifetime issues: where Python simply associates the Py_buffer
> with a PyObject owner (a garbage-collected Python object), we need
> another means to control lifetime of pointed areas.  One simple
> possibility is to include a destructor function pointer in the protocol
> struct.
> 
> * Arrow type representation.  We probably need some kind of "format"
> mini-language to represent Arrow types, so that a type can be described
> using a `const char*`.  Ideally, primitives types at least should be
> trivially parsable.  We may take inspiration from Python here (`struct`
> module format characters, PEP 3118 format additions).
> 
> Example C struct definition (not a formal proposal!):
> 
> struct ArrowBuffer {
>   void* data;
>   int64_t nbytes;
>   // Called by the consumer when it doesn't need the buffer anymore
>   void (*release)(struct ArrowBuffer*);
>   // Opaque user data (for e.g. the release callback)
>   void* user_data;
> };
> 
> struct ArrowArray {
>   // Type description
>   const char* format;
>   // Data description
>   int64_t length;
>   int64_t null_count;
>   int64_t n_buffers;
>   // Note: this pointers are probably owned by the ArrowArray struct
>   // and will be released and free()ed by the release callback.
>   struct BufferDescriptor* buffers;
>   struct ArrowDescriptor* dictionary;
>   // Called by the consumer when it doesn't need the array anymore
>   void (*release)(struct ArrowArrayDescriptor*);
>   // Opaque user data (for e.g. the release callback)
>   void* user_data;
> };
> 
> Thoughts?
> 
> (*) For the record, the reference for the Python buffer protocol:
> https://docs.python.org/3/c-api/buffer.html#buffer-structure
> and its C struct definition:
> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
> 
> Regards
> 
> Antoine.
>

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Antoine Pitrou



I suppose it could be possible for an Arrow array to describe itself
using the ndtypes vocabulary at some point.  However, this is
non-trivial, both on the producer and consumer side.  Moreover, both
sides must ensure they use the same ndtypes description.

The idea here is to have a C data protocol, without any need for a
helper C library, that's a simple as possible and directly expresses the
Arrow data without needing any semantic mapping.  Also it should allow
transmission via FFI layers with as little complication as possible.

Which is why it most probably needs to be Arrow-specific.

Regards

Antoine.


Le 19/09/2019 à 21:14, Travis Oliphant a écrit :
> I know some on this list are familiar, but many may not have seen ndtypes
> in xnd:  https://github.com/xnd-project/ndtypes
> 
> It generalizes PEP 3118 for cross-language data-structure handling.
> 
> Either a dependency on the small C-library libndtypes or using the concepts
> could be done.
> 
> -Travis
> 
> 
> On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou  wrote:
> 
>>
>> Hello,
>>
>> One thing that was discussed in the sync call is the ability to easily
>> pass arrays at runtime between Arrow implementations or Arrow-supporting
>> libraries in the same process, without bearing the cost of linking to
>> e.g. the C++ Arrow library.
>>
>> (for example: "Duckdb wants to provide an option to return Arrow data of
>> result sets, but they don't like having Arrow as a dependency")
>>
>> One possibility would be to define a C-level protocol similar in spirit
>> to the Python buffer protocol, which some people may be familiar with (*).
>>
>> The basic idea is to define a simple C struct, which is ABI-stable and
>> describes an Arrow away adequately.  The struct can be stack-allocated.
>> Its definition can also be copied in another project (or interfaced with
>> using a C FFI layer, depending on the language).
>>
>> There is no formal proposal, this message is meant to stir the discussion.
>>
>> Issues to work out:
>>
>> * Memory lifetime issues: where Python simply associates the Py_buffer
>> with a PyObject owner (a garbage-collected Python object), we need
>> another means to control lifetime of pointed areas.  One simple
>> possibility is to include a destructor function pointer in the protocol
>> struct.
>>
>> * Arrow type representation.  We probably need some kind of "format"
>> mini-language to represent Arrow types, so that a type can be described
>> using a `const char*`.  Ideally, primitives types at least should be
>> trivially parsable.  We may take inspiration from Python here (`struct`
>> module format characters, PEP 3118 format additions).
>>
>> Example C struct definition (not a formal proposal!):
>>
>> struct ArrowBuffer {
>>   void* data;
>>   int64_t nbytes;
>>   // Called by the consumer when it doesn't need the buffer anymore
>>   void (*release)(struct ArrowBuffer*);
>>   // Opaque user data (for e.g. the release callback)
>>   void* user_data;
>> };
>>
>> struct ArrowArray {
>>   // Type description
>>   const char* format;
>>   // Data description
>>   int64_t length;
>>   int64_t null_count;
>>   int64_t n_buffers;
>>   // Note: this pointers are probably owned by the ArrowArray struct
>>   // and will be released and free()ed by the release callback.
>>   struct BufferDescriptor* buffers;
>>   struct ArrowDescriptor* dictionary;
>>   // Called by the consumer when it doesn't need the array anymore
>>   void (*release)(struct ArrowArrayDescriptor*);
>>   // Opaque user data (for e.g. the release callback)
>>   void* user_data;
>> };
>>
>> Thoughts?
>>
>> (*) For the record, the reference for the Python buffer protocol:
>> https://docs.python.org/3/c-api/buffer.html#buffer-structure
>> and its C struct definition:
>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
>>
>> Regards
>>
>> Antoine.
>>
> 
>

Re: [DISCUSS] IPC buffer layout for Null type

2019-09-19 Thread Wes McKinney

OK, my preference, therefore, would be to rebase and merge my patch
without bothering with backwards compatibility code. The situations
where there would be an issue are fairly esoteric.

https://github.com/apache/arrow/pull/5287

On Thu, Sep 19, 2019 at 2:29 PM Antoine Pitrou  wrote:
>
>
> Well, this is an incompatible IPC change, so ideally it should be done
> now, not later.
>
> Regards
>
> Antoine.
>
>
> On Thu, 19 Sep 2019 14:08:37 -0500
> Wes McKinney  wrote:
>
> > I'm concerned about rushing through any patch for this for 0.15.0, but
> > each release with the status quo increases the risk of making changes.
> > Thoughts?
> >
> > On Fri, Sep 6, 2019 at 12:59 PM Wes McKinney  wrote:
> > >
> > > On Fri, Sep 6, 2019 at 12:57 PM Micah Kornfield  
> > > wrote:
> > > >
> > > > >
> > > > > We can't because the buffer layout is not transmitted -- 
> > > > > implementations
> > > > > make assumptions about what Buffer values correspond to each field. 
> > > > > The
> > > > > only thing we could do to signal the change would be to increase the
> > > > > metadata version from V4 to V5.
> > > >
> > > > If we do this within 0.15.0 we could infer from the padding of messages.
> > > >
> > >
> > > That's true. I'd be OK adding backward compatibility code (that we can
> > > probably remove later) to my patch...
> > >
> > > I'm not sure about the other implementations. I think for non-C++
> > > implementations because they don't have much application code that can
> > > produce Null arrays that they should simply use the no-buffers layout
> > >
> > > > On Fri, Sep 6, 2019 at 10:16 AM Wes McKinney  
> > > > wrote:
> > > >
> > > > > On Fri, Sep 6, 2019, 12:08 PM Antoine Pitrou  
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > Null can also come up when converting a column with only NA values 
> > > > > > in a
> > > > > > CSV file.  I don't remember for sure, but I think the same can 
> > > > > > happen
> > > > > > with JSON files as well.
> > > > > >
> > > > > > Can't we accept both forms when reading?  It sounds like it should 
> > > > > > be
> > > > > > reasonably easy.
> > > > > >
> > > > >
> > > > > We can't because the buffer layout is not transmitted -- 
> > > > > implementations
> > > > > make assumptions about what Buffer values correspond to each field. 
> > > > > The
> > > > > only thing we could do to signal the change would be to increase the
> > > > > metadata version from V4 to V5.
> > > > >
> > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
> > > > > > Le 06/09/2019 à 17:36, Wes McKinney a écrit :
> > > > > > > hi Micah,
> > > > > > >
> > > > > > > Null wouldn't come up that often in practice. It could happen when
> > > > > > > converting from pandas, for example
> > > > > > >
> > > > > > > In [8]: df = pd.DataFrame({'col1': np.array([np.nan] * 10,
> > > > > > dtype=object)})
> > > > > > >
> > > > > > > In [9]: t = pa.table(df)
> > > > > > >
> > > > > > > In [10]: t
> > > > > > > Out[10]:
> > > > > > > pyarrow.Table
> > > > > > > col1: null
> > > > > > > metadata
> > > > > > > 
> > > > > > > {b'pandas': b'{"index_columns": [{"kind": "range", "name": null,
> > > > > > "start": 0, "'
> > > > > > > b'stop": 10, "step": 1}], "column_indexes": [{"name": 
> > > > > > > null,
> > > > > > "field'
> > > > > > > b'_name": null, "pandas_type": "unicode", 
> > > > > > > "numpy_type":
> > > > > > "object", '
> > > > > > > b'"metadata": {"encoding": "UTF-8"}}], "columns": 
> > > > > > > [{"name":
> > > > > > "col1"'
> > > > > > > b', "field_name": "col1", "pandas_type": "empty",
> > > > > > "numpy_type": "o'
> > > > > > > b'bject", "metadata": null}], "creator": {"library":
> > > > > > "pyarrow", "v'
> > > > > > > b'ersion": "0.14.1.dev464+g40d08a751"}, 
> > > > > > > "pandas_version":
> > > > > > "0.24.2"'
> > > > > > > b'}'}
> > > > > > >
> > > > > > > I'm inclined to make the change without worrying about backwards
> > > > > > > compatibility. If people have been persisting data against the
> > > > > > > recommendations of the project, the remedy is to use an older 
> > > > > > > version
> > > > > > > of the library to read the files and write them to something else
> > > > > > > (like Parquet format) in the meantime.
> > > > > > >
> > > > > > > Obviously come 1.0.0 we'll begin to make compatibility guarantees 
> > > > > > > so
> > > > > > > this will be less of an issue.
> > > > > > >
> > > > > > > - Wes
> > > > > > >
> > > > > > > On Thu, Sep 5, 2019 at 11:14 PM Micah Kornfield 
> > > > > > >  > > > > >
> > > > > > wrote:
> > > > > > >>
> > > > > > >> Hi Wes and others,
> > > > > > >> I don't have a sense of where Null arrays get created in the 
> > > > > > >> existing
> > > > > > code
> > > > > > >> base?
> > > > > > >>
> > > > > > >> Also, do you think it is worth the effort make this backwards
> > > > > > compatible.
> > > > > > >> We could in theory tie the buffer count to having

[jira] [Created] (ARROW-6624) [C++] Add SparseTensor.ToTensor() method

2019-09-19 Thread Rok Mihevc (Jira)

Rok Mihevc created ARROW-6624:
-

 Summary: [C++] Add SparseTensor.ToTensor() method
 Key: ARROW-6624
 URL: https://issues.apache.org/jira/browse/ARROW-6624
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Rok Mihevc
Assignee: Rok Mihevc


We have functionality to convert (dense) tensors to sparse tensors, but not the 
other way around. Also [see 
discussion|https://github.com/apache/arrow/pull/4446#issuecomment-503792308].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [DISCUSS] IPC buffer layout for Null type

2019-09-19 Thread Antoine Pitrou



Well, this is an incompatible IPC change, so ideally it should be done
now, not later.

Regards

Antoine.


On Thu, 19 Sep 2019 14:08:37 -0500
Wes McKinney  wrote:

> I'm concerned about rushing through any patch for this for 0.15.0, but
> each release with the status quo increases the risk of making changes.
> Thoughts?
> 
> On Fri, Sep 6, 2019 at 12:59 PM Wes McKinney  wrote:
> >
> > On Fri, Sep 6, 2019 at 12:57 PM Micah Kornfield  
> > wrote:  
> > >  
> > > >
> > > > We can't because the buffer layout is not transmitted -- implementations
> > > > make assumptions about what Buffer values correspond to each field. The
> > > > only thing we could do to signal the change would be to increase the
> > > > metadata version from V4 to V5.  
> > >
> > > If we do this within 0.15.0 we could infer from the padding of messages.
> > >  
> >
> > That's true. I'd be OK adding backward compatibility code (that we can
> > probably remove later) to my patch...
> >
> > I'm not sure about the other implementations. I think for non-C++
> > implementations because they don't have much application code that can
> > produce Null arrays that they should simply use the no-buffers layout
> >  
> > > On Fri, Sep 6, 2019 at 10:16 AM Wes McKinney  wrote:
> > >  
> > > > On Fri, Sep 6, 2019, 12:08 PM Antoine Pitrou  wrote:
> > > >  
> > > > >
> > > > > Null can also come up when converting a column with only NA values in 
> > > > > a
> > > > > CSV file.  I don't remember for sure, but I think the same can happen
> > > > > with JSON files as well.
> > > > >
> > > > > Can't we accept both forms when reading?  It sounds like it should be
> > > > > reasonably easy.
> > > > >  
> > > >
> > > > We can't because the buffer layout is not transmitted -- implementations
> > > > make assumptions about what Buffer values correspond to each field. The
> > > > only thing we could do to signal the change would be to increase the
> > > > metadata version from V4 to V5.
> > > >
> > > >  
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > Le 06/09/2019 à 17:36, Wes McKinney a écrit :  
> > > > > > hi Micah,
> > > > > >
> > > > > > Null wouldn't come up that often in practice. It could happen when
> > > > > > converting from pandas, for example
> > > > > >
> > > > > > In [8]: df = pd.DataFrame({'col1': np.array([np.nan] * 10,  
> > > > > dtype=object)})  
> > > > > >
> > > > > > In [9]: t = pa.table(df)
> > > > > >
> > > > > > In [10]: t
> > > > > > Out[10]:
> > > > > > pyarrow.Table
> > > > > > col1: null
> > > > > > metadata
> > > > > > 
> > > > > > {b'pandas': b'{"index_columns": [{"kind": "range", "name": null,  
> > > > > "start": 0, "'  
> > > > > > b'stop": 10, "step": 1}], "column_indexes": [{"name": 
> > > > > > null,  
> > > > > "field'  
> > > > > > b'_name": null, "pandas_type": "unicode", "numpy_type": 
> > > > > >  
> > > > > "object", '  
> > > > > > b'"metadata": {"encoding": "UTF-8"}}], "columns": 
> > > > > > [{"name":  
> > > > > "col1"'  
> > > > > > b', "field_name": "col1", "pandas_type": "empty",  
> > > > > "numpy_type": "o'  
> > > > > > b'bject", "metadata": null}], "creator": {"library":  
> > > > > "pyarrow", "v'  
> > > > > > b'ersion": "0.14.1.dev464+g40d08a751"}, 
> > > > > > "pandas_version":  
> > > > > "0.24.2"'  
> > > > > > b'}'}
> > > > > >
> > > > > > I'm inclined to make the change without worrying about backwards
> > > > > > compatibility. If people have been persisting data against the
> > > > > > recommendations of the project, the remedy is to use an older 
> > > > > > version
> > > > > > of the library to read the files and write them to something else
> > > > > > (like Parquet format) in the meantime.
> > > > > >
> > > > > > Obviously come 1.0.0 we'll begin to make compatibility guarantees so
> > > > > > this will be less of an issue.
> > > > > >
> > > > > > - Wes
> > > > > >
> > > > > > On Thu, Sep 5, 2019 at 11:14 PM Micah Kornfield 
> > > > > >  > > > >
> > > > > wrote:  
> > > > > >>
> > > > > >> Hi Wes and others,
> > > > > >> I don't have a sense of where Null arrays get created in the 
> > > > > >> existing  
> > > > > code  
> > > > > >> base?
> > > > > >>
> > > > > >> Also, do you think it is worth the effort make this backwards  
> > > > > compatible.  
> > > > > >> We could in theory tie the buffer count to having the continuation 
> > > > > >>  
> > > > value  
> > > > > >> for alignment.
> > > > > >>
> > > > > >> The one area were I'm slightly concerned is we seem to have users 
> > > > > >> in  
> > > > the  
> > > > > >> wild who are depending on backwards compatibility, and I'm try to  
> > > > better  
> > > > > >> understand the odds that we break them.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Micah
> > > > > >>
> > > > > >> On Thu, Sep 5, 2019 at 7:25 AM Wes McKinney   
> > > > > wrote:  
> > > > > >>  
> > > > > >>> hi folks,
> > > > > >>>
> > > > > >>> One of

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Travis Oliphant

I know some on this list are familiar, but many may not have seen ndtypes
in xnd:  https://github.com/xnd-project/ndtypes

It generalizes PEP 3118 for cross-language data-structure handling.

Either a dependency on the small C-library libndtypes or using the concepts
could be done.

-Travis


On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou  wrote:

>
> Hello,
>
> One thing that was discussed in the sync call is the ability to easily
> pass arrays at runtime between Arrow implementations or Arrow-supporting
> libraries in the same process, without bearing the cost of linking to
> e.g. the C++ Arrow library.
>
> (for example: "Duckdb wants to provide an option to return Arrow data of
> result sets, but they don't like having Arrow as a dependency")
>
> One possibility would be to define a C-level protocol similar in spirit
> to the Python buffer protocol, which some people may be familiar with (*).
>
> The basic idea is to define a simple C struct, which is ABI-stable and
> describes an Arrow away adequately.  The struct can be stack-allocated.
> Its definition can also be copied in another project (or interfaced with
> using a C FFI layer, depending on the language).
>
> There is no formal proposal, this message is meant to stir the discussion.
>
> Issues to work out:
>
> * Memory lifetime issues: where Python simply associates the Py_buffer
> with a PyObject owner (a garbage-collected Python object), we need
> another means to control lifetime of pointed areas.  One simple
> possibility is to include a destructor function pointer in the protocol
> struct.
>
> * Arrow type representation.  We probably need some kind of "format"
> mini-language to represent Arrow types, so that a type can be described
> using a `const char*`.  Ideally, primitives types at least should be
> trivially parsable.  We may take inspiration from Python here (`struct`
> module format characters, PEP 3118 format additions).
>
> Example C struct definition (not a formal proposal!):
>
> struct ArrowBuffer {
>   void* data;
>   int64_t nbytes;
>   // Called by the consumer when it doesn't need the buffer anymore
>   void (*release)(struct ArrowBuffer*);
>   // Opaque user data (for e.g. the release callback)
>   void* user_data;
> };
>
> struct ArrowArray {
>   // Type description
>   const char* format;
>   // Data description
>   int64_t length;
>   int64_t null_count;
>   int64_t n_buffers;
>   // Note: this pointers are probably owned by the ArrowArray struct
>   // and will be released and free()ed by the release callback.
>   struct BufferDescriptor* buffers;
>   struct ArrowDescriptor* dictionary;
>   // Called by the consumer when it doesn't need the array anymore
>   void (*release)(struct ArrowArrayDescriptor*);
>   // Opaque user data (for e.g. the release callback)
>   void* user_data;
> };
>
> Thoughts?
>
> (*) For the record, the reference for the Python buffer protocol:
> https://docs.python.org/3/c-api/buffer.html#buffer-structure
> and its C struct definition:
> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
>
> Regards
>
> Antoine.
>


-- 

*Travis Oliphant*
CEO
512 826 7480

Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-09-19-0

2019-09-19 Thread Wes McKinney

I just created https://issues.apache.org/jira/browse/ARROW-6623 about
the Dask integration test failure

On Thu, Sep 19, 2019 at 8:35 AM Crossbow  wrote:
>
>
> Arrow Build Report for Job nightly-2019-09-19-0
>
> All tasks: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0
>
> Failed Tasks:
> - docker-dask-integration:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-dask-integration
> - wheel-osx-cp27m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-osx-cp27m
> - docker-cpp-fuzzit:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-cpp-fuzzit
> - docker-spark-integration:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-spark-integration
> - docker-pandas-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-pandas-master
>
> Succeeded Tasks:
> - docker-r:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-r
> - docker-lint:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-lint
> - wheel-manylinux1-cp27mu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux1-cp27mu
> - wheel-manylinux2010-cp36m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux2010-cp36m
> - wheel-manylinux1-cp37m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux1-cp37m
> - wheel-manylinux2010-cp27mu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux2010-cp27mu
> - wheel-manylinux2010-cp27m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux2010-cp27m
> - wheel-osx-cp35m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-osx-cp35m
> - ubuntu-xenial-arm64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-azure-ubuntu-xenial-arm64
> - conda-linux-gcc-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-azure-conda-linux-gcc-py36
> - docker-rust:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-rust
> - docker-r-conda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-r-conda
> - docker-cpp-release:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-cpp-release
> - conda-linux-gcc-py27:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-azure-conda-linux-gcc-py27
> - gandiva-jar-osx:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-gandiva-jar-osx
> - wheel-win-cp35m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-appveyor-wheel-win-cp35m
> - wheel-manylinux2010-cp37m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux2010-cp37m
> - debian-stretch:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-azure-debian-stretch
> - wheel-win-cp36m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-appveyor-wheel-win-cp36m
> - docker-python-2.7-nopandas:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-python-2.7-nopandas
> - docker-clang-format:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-clang-format
> - wheel-osx-cp37m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-osx-cp37m
> - wheel-manylinux1-cp36m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux1-cp36m
> - docker-cpp-static-only:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-cpp-static-only
> - docker-python-3.6:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-python-3.6
> - docker-hdfs-integration:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-hdfs-integration
> - wheel-manylinux1-cp35m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux1-cp35m
> - ubuntu-xenial:
>   URL: 
>

[jira] [Created] (ARROW-6623) [CI][Python] Dask docker integration test broken perhaps by statistics-related change

2019-09-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6623:
---

 Summary: [CI][Python] Dask docker integration test broken perhaps 
by statistics-related change
 Key: ARROW-6623
 URL: https://issues.apache.org/jira/browse/ARROW-6623
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.15.0


see new failure 

https://circleci.com/gh/ursa-labs/crossbow/3027?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link

{code}
=== FAILURES ===
___ test_timeseries_nulls_in_schema[pyarrow] ___

tmpdir = local('/tmp/pytest-of-root/pytest-0/test_timeseries_nulls_in_schem0')
engine = 'pyarrow'

def test_timeseries_nulls_in_schema(tmpdir, engine):
tmp_path = str(tmpdir)
ddf2 = (
dask.datasets.timeseries(start="2000-01-01", end="2000-01-03", 
freq="1h")
.reset_index()
.map_partitions(lambda x: x.loc[:5])
)
ddf2 = ddf2.set_index("x").reset_index().persist()
ddf2.name = ddf2.name.where(ddf2.timestamp == "2000-01-01", None)

ddf2.to_parquet(tmp_path, engine=engine)
ddf_read = dd.read_parquet(tmp_path, engine=engine)

assert_eq(ddf_read, ddf2, check_divisions=False, check_index=False)

# Can force schema validation on each partition in pyarrow
if engine == "pyarrow":
# The schema mismatch should raise an error
with pytest.raises(ValueError):
ddf_read = dd.read_parquet(
tmp_path, dataset={"validate_schema": True}, engine=engine
)
# There should be no error if you specify a schema on write
schema = pa.schema(
[
("x", pa.float64()),
("timestamp", pa.timestamp("ns")),
("id", pa.int64()),
("name", pa.string()),
("y", pa.float64()),
]
)
ddf2.to_parquet(tmp_path, schema=schema, engine=engine)
assert_eq(
>   dd.read_parquet(tmp_path, dataset={"validate_schema": True}, 
> engine=engine),
ddf2,
check_divisions=False,
check_index=False,
)

opt/conda/lib/python3.6/site-packages/dask/dataframe/io/tests/test_parquet.py:1964:
 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
opt/conda/lib/python3.6/site-packages/dask/dataframe/io/parquet/core.py:190: in 
read_parquet
out = sorted_columns(statistics)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

statistics = ({'columns': [{'max': -0.25838390663957256, 'min': 
-0.979681447427093, 'name': 'x', 'null_count': 0}, {'max': 
Timestam...ull_count': 0}, {'max': 0.8978352477516438, 'min': 
-0.7218571212693894, 'name': 'y', 'null_count': 0}], 'num-rows': 7})

def sorted_columns(statistics):
""" Find sorted columns given row-group statistics

This finds all columns that are sorted, along with appropriate divisions
values for those columns

Returns
---
out: List of {'name': str, 'divisions': List[str]} dictionaries
"""
if not statistics:
return []

out = []
for i, c in enumerate(statistics[0]["columns"]):
if not all(
"min" in s["columns"][i] and "max" in s["columns"][i] for s in 
statistics
):
continue
divisions = [c["min"]]
max = c["max"]
success = True
for stats in statistics[1:]:
c = stats["columns"][i]
>   if c["min"] >= max:
E   TypeError: '>=' not supported between instances of 
'numpy.ndarray' and 'str'

opt/conda/lib/python3.6/site-packages/dask/dataframe/io/parquet/core.py:570: 
TypeError
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [DISCUSS] IPC buffer layout for Null type

2019-09-19 Thread Wes McKinney

I'm concerned about rushing through any patch for this for 0.15.0, but
each release with the status quo increases the risk of making changes.
Thoughts?

On Fri, Sep 6, 2019 at 12:59 PM Wes McKinney  wrote:
>
> On Fri, Sep 6, 2019 at 12:57 PM Micah Kornfield  wrote:
> >
> > >
> > > We can't because the buffer layout is not transmitted -- implementations
> > > make assumptions about what Buffer values correspond to each field. The
> > > only thing we could do to signal the change would be to increase the
> > > metadata version from V4 to V5.
> >
> > If we do this within 0.15.0 we could infer from the padding of messages.
> >
>
> That's true. I'd be OK adding backward compatibility code (that we can
> probably remove later) to my patch...
>
> I'm not sure about the other implementations. I think for non-C++
> implementations because they don't have much application code that can
> produce Null arrays that they should simply use the no-buffers layout
>
> > On Fri, Sep 6, 2019 at 10:16 AM Wes McKinney  wrote:
> >
> > > On Fri, Sep 6, 2019, 12:08 PM Antoine Pitrou  wrote:
> > >
> > > >
> > > > Null can also come up when converting a column with only NA values in a
> > > > CSV file.  I don't remember for sure, but I think the same can happen
> > > > with JSON files as well.
> > > >
> > > > Can't we accept both forms when reading?  It sounds like it should be
> > > > reasonably easy.
> > > >
> > >
> > > We can't because the buffer layout is not transmitted -- implementations
> > > make assumptions about what Buffer values correspond to each field. The
> > > only thing we could do to signal the change would be to increase the
> > > metadata version from V4 to V5.
> > >
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 06/09/2019 à 17:36, Wes McKinney a écrit :
> > > > > hi Micah,
> > > > >
> > > > > Null wouldn't come up that often in practice. It could happen when
> > > > > converting from pandas, for example
> > > > >
> > > > > In [8]: df = pd.DataFrame({'col1': np.array([np.nan] * 10,
> > > > dtype=object)})
> > > > >
> > > > > In [9]: t = pa.table(df)
> > > > >
> > > > > In [10]: t
> > > > > Out[10]:
> > > > > pyarrow.Table
> > > > > col1: null
> > > > > metadata
> > > > > 
> > > > > {b'pandas': b'{"index_columns": [{"kind": "range", "name": null,
> > > > "start": 0, "'
> > > > > b'stop": 10, "step": 1}], "column_indexes": [{"name": 
> > > > > null,
> > > > "field'
> > > > > b'_name": null, "pandas_type": "unicode", "numpy_type":
> > > > "object", '
> > > > > b'"metadata": {"encoding": "UTF-8"}}], "columns": 
> > > > > [{"name":
> > > > "col1"'
> > > > > b', "field_name": "col1", "pandas_type": "empty",
> > > > "numpy_type": "o'
> > > > > b'bject", "metadata": null}], "creator": {"library":
> > > > "pyarrow", "v'
> > > > > b'ersion": "0.14.1.dev464+g40d08a751"}, "pandas_version":
> > > > "0.24.2"'
> > > > > b'}'}
> > > > >
> > > > > I'm inclined to make the change without worrying about backwards
> > > > > compatibility. If people have been persisting data against the
> > > > > recommendations of the project, the remedy is to use an older version
> > > > > of the library to read the files and write them to something else
> > > > > (like Parquet format) in the meantime.
> > > > >
> > > > > Obviously come 1.0.0 we'll begin to make compatibility guarantees so
> > > > > this will be less of an issue.
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Thu, Sep 5, 2019 at 11:14 PM Micah Kornfield  > > >
> > > > wrote:
> > > > >>
> > > > >> Hi Wes and others,
> > > > >> I don't have a sense of where Null arrays get created in the existing
> > > > code
> > > > >> base?
> > > > >>
> > > > >> Also, do you think it is worth the effort make this backwards
> > > > compatible.
> > > > >> We could in theory tie the buffer count to having the continuation
> > > value
> > > > >> for alignment.
> > > > >>
> > > > >> The one area were I'm slightly concerned is we seem to have users in
> > > the
> > > > >> wild who are depending on backwards compatibility, and I'm try to
> > > better
> > > > >> understand the odds that we break them.
> > > > >>
> > > > >> Thanks,
> > > > >> Micah
> > > > >>
> > > > >> On Thu, Sep 5, 2019 at 7:25 AM Wes McKinney 
> > > > wrote:
> > > > >>
> > > > >>> hi folks,
> > > > >>>
> > > > >>> One of the as-yet-untested (in integration tests) parts of the
> > > > >>> columnar specification is the Null layout. In C++ we additionally
> > > > >>> implemented this by writing two length-0 "placeholder" buffers in 
> > > > >>> the
> > > > >>> RecordBatch data header, but since the Null layout has no memory
> > > > >>> allocated nor any buffers in-memory it may be more proper to write 
> > > > >>> no
> > > > >>> buffers (since the length of the Null layout is all you need to
> > > > >>> reconstruct it). There are 3 implementations of the placeholder
> > > > >>> version (C++, Go, JS, maybe also C#) but

[jira] [Created] (ARROW-6622) [C++][R] SubTreeFileSystem path error on Windows

2019-09-19 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-6622:
--

 Summary: [C++][R] SubTreeFileSystem path error on Windows
 Key: ARROW-6622
 URL: https://issues.apache.org/jira/browse/ARROW-6622
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Reporter: Neal Richardson
 Fix For: 1.0.0


On ARROW-6438, we got this error on Windows testing out the subtree:

{code}
> test_check("arrow")
  -- 1. Error: SubTreeFilesystem (@test-filesystem.R#86)  

  Unknown error: Underlying filesystem returned path 
'C:/Users/appveyor/AppData/Local/Temp/1/RtmpqWFbxi/working_dir/Rtmp2Dfa6d/file2904934312d/DESCRIPTION',
 which is not a subpath of 
'C:/Users/appveyor/AppData/Local/Temp/1\RtmpqWFbxi/working_dir\Rtmp2Dfa6d\file2904934312d/'
  1: st_fs$GetTargetStats(c("DESCRIPTION", "test", "nope", "DESC.txt")) at 
testthat/test-filesystem.R:86
  2: map(fs___FileSystem__GetTargetStats_Paths(self, x), shared_ptr, class = 
FileStats)
  3: fs___FileSystem__GetTargetStats_Paths(self, x)
  
  == testthat results  
===
  [ OK: 992 | SKIPPED: 2 | WARNINGS: 0 | FAILED: 1 ]
{code}

Notice the mixture of forward slashes and backslashes in the paths so that they 
don't match up. 

I'm not sure which layer is doing the wrong thing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: Collecting Arrow critique and our roadmap on that

2019-09-19 Thread Neal Richardson

Uwe, I think this is an excellent idea. I've started
https://docs.google.com/document/d/1cgN7mYzH30URDTaioHsCP2d80wKKHDNs9f5s7vdb2mA/edit?usp=sharing
to collect some ideas and notes. Once we have gathered our thoughts
there, we can put them in the appropriate places.

I think that some of the result will go into the FAQ, some into
documentation (maybe more "how-to" and "getting started" guides in the
respective language docs, as well as some "how to share Arrow data
from X to Y"), and other things that we haven't yet done should go
into a sort of Roadmap document on the main website. We have some very
outdated content related to a roadmap on the confluence wiki that
should be folded in as appropriate too.

Neal

On Thu, Sep 19, 2019 at 10:26 AM Uwe L. Korn  wrote:
>
> Hello,
>
> there has been a lot of public discussions lately with some mentions of 
> actually informed, valid critique of things in the Arrow project. From my 
> perspective, these things include "there is not STL-native C++ Arrow API", 
> "the base build requires too much dependencies", "the pyarrow package is 
> really huge and you cannot select single components". These are things we 
> cannot tackle at the moment due to the lack of contributors to the project. 
> But we can use this as a basis to point people that critique the project on 
> this that this is not intentional but a lack of resources as well as it 
> provides another point of entry for new contributors looking for work.
>
> Thus I would like to start a document (possibly on the website) where we list 
> the major critiques on Arrow, mention our long-term solution to that and what 
> JIRAs need to be done for that.
>
> Would that be something others would also see as valuable?
>
> There has also been a lot of uninformed criticism, I think that can be best 
> combat by documentation, blog posts and public appearances at conferences and 
> is not covered by this proposal.
>
> Uwe

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Zhuo Peng

On Thu, Sep 19, 2019 at 10:56 Antoine Pitrou  wrote:

>
> Le 19/09/2019 à 19:52, Zhuo Peng a écrit :
> >
> > The problems are only potential and theoretical, and won't bite anyone
> > until it occurs though, and it's more likely to happen with pip/wheel
> than
> > with conda.
> >
> > But anyways, this idea is still nice. I could imagine at least in arrow's
> > Python-C-API, there would be a
> >
> > PyObject* pyarrow_array_from_c_protocol(ArrayArray*);
> >
> > this way the C++ APIs can be avoided while still allowing arrays to be
> > created in C/C++ and used in python.
>
> Adding a Python C API function is a nice idea.
> However, I *still* don't understand how it will solve your problem.  The
> Cython modules comprising PyArrow will still call the C++ APIs, with the
> ABI problems that entails.

Those calls are internal to libarrow.so and libarrow_python.so which always
agrees on the ABI.

It’s different from the client library having to create an arrow::Array
which may contain, say a std::vector from gcc5, then pass it to an
Arrow C++ API exposed by libarrow.so, whose definition of std::vector
is from gcc7.

>
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Antoine Pitrou



Le 19/09/2019 à 19:52, Zhuo Peng a écrit :
> 
> The problems are only potential and theoretical, and won't bite anyone
> until it occurs though, and it's more likely to happen with pip/wheel than
> with conda.
> 
> But anyways, this idea is still nice. I could imagine at least in arrow's
> Python-C-API, there would be a
> 
> PyObject* pyarrow_array_from_c_protocol(ArrayArray*);
> 
> this way the C++ APIs can be avoided while still allowing arrays to be
> created in C/C++ and used in python.

Adding a Python C API function is a nice idea.
However, I *still* don't understand how it will solve your problem.  The
Cython modules comprising PyArrow will still call the C++ APIs, with the
ABI problems that entails.

Regards

Antoine.

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Zhuo Peng

On Thu, Sep 19, 2019 at 10:18 AM Antoine Pitrou  wrote:

>
> No, the plan for this proposal is to avoid providing a C API.  Each
> Arrow implementation could produce and consume the C data protocol, for
> example the C++ Array class could add these methods:
>
> class Array {
>   // ...
>
>  public:
>   // Export array to the C data protocol
>   void Share(ArrowArray* out);
>   // Import a C data protocol array
>   static Status FromShared(ArrowArray* input,
>std::shared_ptr* out);
> };
>
> Also, I don't know why a C API exposed by the C++ library would solve
> your problem.  You would still have a problem with bundling the .so,
> symbol conflicts if several libraries load libarrow.so, etc.

The problem is mainly about C++ not being able to provide a stable ABI for
templates (thus STL). If Arrow C++ library's public headers contain
templates or definitions from STL, the only way to guarantee safety is to
force the client library use the same toolchain and the same flags with
which the Arrow DSO was built. (Yes, distribution methods like Conda help
mitigate that issue by enforcing a uniform toolchain (almost), but problems
can still occur, if, say a client is built with --std=c++17 while
libarrow.so is built with --std=gnu11 (example at [1]).

The problems are only potential and theoretical, and won't bite anyone
until it occurs though, and it's more likely to happen with pip/wheel than
with conda.

But anyways, this idea is still nice. I could imagine at least in arrow's
Python-C-API, there would be a

PyObject* pyarrow_array_from_c_protocol(ArrayArray*);

this way the C++ APIs can be avoided while still allowing arrays to be
created in C/C++ and used in python.

[1] https://github.com/tensorflow/tensorflow/issues/23561

Regards
>
> Antoine.
>
>
> Le 19/09/2019 à 18:21, Zhuo Peng a écrit :
> > Hi Antoine,
> >
> > I'm also interested in a stable ABI (previously I posted on this mailing
> > list about the ABI issues I had [1]). Does having such an ABI-stable
> > C-struct imply that there will be a set of C-APIs exposed by the Arrow
> > (C++) library (which I think would lead to a solution to all the inherit
> > ABI issues caused by C++)?
> >
> > [1]
> >
> https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E
> >
> > On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
> >>> I like the idea of a stable ABI for in-processing  that can be used for
> >> in
> >>> process communication.  For instance, there was a recent question on
> >>> stack-overflow on how to solve this [1].
> >>>
> >>> A couple of thoughts/questions:
> >>> * Would ArrowArray also need a self reference for children arrays?
> >>
> >> Yes, I forgot that.  I also think we don't need a separate Buffer
> >> struct, instead the Array struct should own all its buffers.
> >>
> >>> * Should transferring key-value metadata be in scope?
> >>
> >> Yes.  It could either be in the format string or a separate string.  The
> >> upside of a separate string is that a consumer may ignore it trivially
> >> if it doesn't need the information.
> >>
> >> Another open question is for nested types: does the format string
> >> represent the entire type including children?  Or must child types be
> >> read in the child arrays?  If we mimick ArrayData, then the format
> >> string should represent the entire type; it will then be more complex to
> >> parse.
> >>
> >> We should also make sure that extension types fit in the protocol.
> >>
> >>> * Should the API more closely align the IPC spec (pass a schema
> >> separately
> >>> and list of buffers instead of individual arrays)?
> >>
> >> Then you have that's not immediately usable (you have to do some
> >> processing to reconstitute the individual arrays).  One goal here is to
> >> minimize implementation costs for producers and consumers.  The
> >> assumption is a data model similar to the C++ ArrowData model; do we
> >> have implementations that use an entirely different model?  Perhaps I
> >> should take a look :-)
> >>
> >> Note that the draft I posted only concerns arrays.  We may also want to
> >> have a C struct for batches or tables.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>> [1]
> >>>
> >>
> https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220
> >>>
> >>> On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou 
> >> wrote:
> >>>
> 
>  Hello,
> 
>  One thing that was discussed in the sync call is the ability to easily
>  pass arrays at runtime between Arrow implementations or
> Arrow-supporting
>  libraries in the same process, without bearing the cost of linking to
>  e.g. the C++ Arrow library.
> 
>  (for example: "Duckdb wants to provide an option to return Arrow data
> of
>  result sets,

Collecting Arrow critique and our roadmap on that

2019-09-19 Thread Uwe L. Korn

Hello,

there has been a lot of public discussions lately with some mentions of 
actually informed, valid critique of things in the Arrow project. From my 
perspective, these things include "there is not STL-native C++ Arrow API", "the 
base build requires too much dependencies", "the pyarrow package is really huge 
and you cannot select single components". These are things we cannot tackle at 
the moment due to the lack of contributors to the project. But we can use this 
as a basis to point people that critique the project on this that this is not 
intentional but a lack of resources as well as it provides another point of 
entry for new contributors looking for work.

Thus I would like to start a document (possibly on the website) where we list 
the major critiques on Arrow, mention our long-term solution to that and what 
JIRAs need to be done for that.

Would that be something others would also see as valuable?

There has also been a lot of uninformed criticism, I think that can be best 
combat by documentation, blog posts and public appearances at conferences and 
is not covered by this proposal.

Uwe

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Antoine Pitrou



Le 19/09/2019 à 19:11, Uwe L. Korn a écrit :
> Hello,
> 
> I like this proposal as it will make interfacing inside a process between 
> various Arrow supports much easier. I'm a bit critical though of using a 
> string as the format representation as one needs to parse it correctly. 
> Couldn't we use the enums we already have and reimplement them as C-defines 
> instead?

We could, but then we need to represent type parameters separately, as
some types are parametric (such as Time-related types).  So we would
still have some kind of encoded representation for those parameters.

So it may be as easy to represent everything inside the format string:
the type class (a single character perhaps) and optionally the type
instance parameters (if necessary).

Note that for non-parametric primitive types such as int64_t, double,
utf8... the format string will be a single character anyway.

Regards

Antoine.


> 
> Uwe
> 
> On Thu, Sep 19, 2019, at 6:21 PM, Zhuo Peng wrote:
>> Hi Antoine,
>>
>> I'm also interested in a stable ABI (previously I posted on this mailing
>> list about the ABI issues I had [1]). Does having such an ABI-stable
>> C-struct imply that there will be a set of C-APIs exposed by the Arrow
>> (C++) library (which I think would lead to a solution to all the inherit
>> ABI issues caused by C++)?
>>
>> [1]
>> https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E
>>
>> On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou  wrote:
>>
>>>
>>> Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
 I like the idea of a stable ABI for in-processing  that can be used for
>>> in
 process communication.  For instance, there was a recent question on
 stack-overflow on how to solve this [1].

 A couple of thoughts/questions:
 * Would ArrowArray also need a self reference for children arrays?
>>>
>>> Yes, I forgot that.  I also think we don't need a separate Buffer
>>> struct, instead the Array struct should own all its buffers.
>>>
 * Should transferring key-value metadata be in scope?
>>>
>>> Yes.  It could either be in the format string or a separate string.  The
>>> upside of a separate string is that a consumer may ignore it trivially
>>> if it doesn't need the information.
>>>
>>> Another open question is for nested types: does the format string
>>> represent the entire type including children?  Or must child types be
>>> read in the child arrays?  If we mimick ArrayData, then the format
>>> string should represent the entire type; it will then be more complex to
>>> parse.
>>>
>>> We should also make sure that extension types fit in the protocol.
>>>
 * Should the API more closely align the IPC spec (pass a schema
>>> separately
 and list of buffers instead of individual arrays)?
>>>
>>> Then you have that's not immediately usable (you have to do some
>>> processing to reconstitute the individual arrays).  One goal here is to
>>> minimize implementation costs for producers and consumers.  The
>>> assumption is a data model similar to the C++ ArrowData model; do we
>>> have implementations that use an entirely different model?  Perhaps I
>>> should take a look :-)
>>>
>>> Note that the draft I posted only concerns arrays.  We may also want to
>>> have a C struct for batches or tables.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>

 Thanks,
 Micah

 [1]

>>> https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220

 On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou 
>>> wrote:

>
> Hello,
>
> One thing that was discussed in the sync call is the ability to easily
> pass arrays at runtime between Arrow implementations or Arrow-supporting
> libraries in the same process, without bearing the cost of linking to
> e.g. the C++ Arrow library.
>
> (for example: "Duckdb wants to provide an option to return Arrow data of
> result sets, but they don't like having Arrow as a dependency")
>
> One possibility would be to define a C-level protocol similar in spirit
> to the Python buffer protocol, which some people may be familiar with
>>> (*).
>
> The basic idea is to define a simple C struct, which is ABI-stable and
> describes an Arrow away adequately.  The struct can be stack-allocated.
> Its definition can also be copied in another project (or interfaced with
> using a C FFI layer, depending on the language).
>
> There is no formal proposal, this message is meant to stir the
>>> discussion.
>
> Issues to work out:
>
> * Memory lifetime issues: where Python simply associates the Py_buffer
> with a PyObject owner (a garbage-collected Python object), we need
> another means to control lifetime of pointed areas.  One simple
> possibility is to include a destructor function pointer in the protocol
> struct.

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Antoine Pitrou

No, the plan for this proposal is to avoid providing a C API.  Each
Arrow implementation could produce and consume the C data protocol, for
example the C++ Array class could add these methods:

class Array {
  // ...

 public:
  // Export array to the C data protocol
  void Share(ArrowArray* out);
  // Import a C data protocol array
  static Status FromShared(ArrowArray* input,
   std::shared_ptr* out);
};

Also, I don't know why a C API exposed by the C++ library would solve
your problem.  You would still have a problem with bundling the .so,
symbol conflicts if several libraries load libarrow.so, etc.

Regards

Antoine.

Le 19/09/2019 à 18:21, Zhuo Peng a écrit :
> Hi Antoine,
> 
> I'm also interested in a stable ABI (previously I posted on this mailing
> list about the ABI issues I had [1]). Does having such an ABI-stable
> C-struct imply that there will be a set of C-APIs exposed by the Arrow
> (C++) library (which I think would lead to a solution to all the inherit
> ABI issues caused by C++)?
> 
> [1]
> https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E
> 
> On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou  wrote:
> 
>>
>> Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
>>> I like the idea of a stable ABI for in-processing  that can be used for
>> in
>>> process communication.  For instance, there was a recent question on
>>> stack-overflow on how to solve this [1].
>>>
>>> A couple of thoughts/questions:
>>> * Would ArrowArray also need a self reference for children arrays?
>>
>> Yes, I forgot that.  I also think we don't need a separate Buffer
>> struct, instead the Array struct should own all its buffers.
>>
>>> * Should transferring key-value metadata be in scope?
>>
>> Yes.  It could either be in the format string or a separate string.  The
>> upside of a separate string is that a consumer may ignore it trivially
>> if it doesn't need the information.
>>
>> Another open question is for nested types: does the format string
>> represent the entire type including children?  Or must child types be
>> read in the child arrays?  If we mimick ArrayData, then the format
>> string should represent the entire type; it will then be more complex to
>> parse.
>>
>> We should also make sure that extension types fit in the protocol.
>>
>>> * Should the API more closely align the IPC spec (pass a schema
>> separately
>>> and list of buffers instead of individual arrays)?
>>
>> Then you have that's not immediately usable (you have to do some
>> processing to reconstitute the individual arrays).  One goal here is to
>> minimize implementation costs for producers and consumers.  The
>> assumption is a data model similar to the C++ ArrowData model; do we
>> have implementations that use an entirely different model?  Perhaps I
>> should take a look :-)
>>
>> Note that the draft I posted only concerns arrays.  We may also want to
>> have a C struct for batches or tables.
>>
>> Regards
>>
>> Antoine.
>>
>>
>>>
>>> Thanks,
>>> Micah
>>>
>>> [1]
>>>
>> https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220
>>>
>>> On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou 
>> wrote:
>>>

 Hello,

 One thing that was discussed in the sync call is the ability to easily
 pass arrays at runtime between Arrow implementations or Arrow-supporting
 libraries in the same process, without bearing the cost of linking to
 e.g. the C++ Arrow library.

 (for example: "Duckdb wants to provide an option to return Arrow data of
 result sets, but they don't like having Arrow as a dependency")

 One possibility would be to define a C-level protocol similar in spirit
 to the Python buffer protocol, which some people may be familiar with
>> (*).

 The basic idea is to define a simple C struct, which is ABI-stable and
 describes an Arrow away adequately.  The struct can be stack-allocated.
 Its definition can also be copied in another project (or interfaced with
 using a C FFI layer, depending on the language).

 There is no formal proposal, this message is meant to stir the
>> discussion.

 Issues to work out:

 * Memory lifetime issues: where Python simply associates the Py_buffer
 with a PyObject owner (a garbage-collected Python object), we need
 another means to control lifetime of pointed areas.  One simple
 possibility is to include a destructor function pointer in the protocol
 struct.

 * Arrow type representation.  We probably need some kind of "format"
 mini-language to represent Arrow types, so that a type can be described
 using a `const char*`.  Ideally, primitives types at least should be
 trivially parsable.  We may take inspiration from Python here (`struct`
 module format characters, PEP 3118 format additions).

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Uwe L. Korn

Hello,

I like this proposal as it will make interfacing inside a process between 
various Arrow supports much easier. I'm a bit critical though of using a string 
as the format representation as one needs to parse it correctly. Couldn't we 
use the enums we already have and reimplement them as C-defines instead?

Uwe

On Thu, Sep 19, 2019, at 6:21 PM, Zhuo Peng wrote:
> Hi Antoine,
> 
> I'm also interested in a stable ABI (previously I posted on this mailing
> list about the ABI issues I had [1]). Does having such an ABI-stable
> C-struct imply that there will be a set of C-APIs exposed by the Arrow
> (C++) library (which I think would lead to a solution to all the inherit
> ABI issues caused by C++)?
> 
> [1]
> https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E
> 
> On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou  wrote:
> 
> >
> > Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
> > > I like the idea of a stable ABI for in-processing  that can be used for
> > in
> > > process communication.  For instance, there was a recent question on
> > > stack-overflow on how to solve this [1].
> > >
> > > A couple of thoughts/questions:
> > > * Would ArrowArray also need a self reference for children arrays?
> >
> > Yes, I forgot that.  I also think we don't need a separate Buffer
> > struct, instead the Array struct should own all its buffers.
> >
> > > * Should transferring key-value metadata be in scope?
> >
> > Yes.  It could either be in the format string or a separate string.  The
> > upside of a separate string is that a consumer may ignore it trivially
> > if it doesn't need the information.
> >
> > Another open question is for nested types: does the format string
> > represent the entire type including children?  Or must child types be
> > read in the child arrays?  If we mimick ArrayData, then the format
> > string should represent the entire type; it will then be more complex to
> > parse.
> >
> > We should also make sure that extension types fit in the protocol.
> >
> > > * Should the API more closely align the IPC spec (pass a schema
> > separately
> > > and list of buffers instead of individual arrays)?
> >
> > Then you have that's not immediately usable (you have to do some
> > processing to reconstitute the individual arrays).  One goal here is to
> > minimize implementation costs for producers and consumers.  The
> > assumption is a data model similar to the C++ ArrowData model; do we
> > have implementations that use an entirely different model?  Perhaps I
> > should take a look :-)
> >
> > Note that the draft I posted only concerns arrays.  We may also want to
> > have a C struct for batches or tables.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1]
> > >
> > https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220
> > >
> > > On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou 
> > wrote:
> > >
> > >>
> > >> Hello,
> > >>
> > >> One thing that was discussed in the sync call is the ability to easily
> > >> pass arrays at runtime between Arrow implementations or Arrow-supporting
> > >> libraries in the same process, without bearing the cost of linking to
> > >> e.g. the C++ Arrow library.
> > >>
> > >> (for example: "Duckdb wants to provide an option to return Arrow data of
> > >> result sets, but they don't like having Arrow as a dependency")
> > >>
> > >> One possibility would be to define a C-level protocol similar in spirit
> > >> to the Python buffer protocol, which some people may be familiar with
> > (*).
> > >>
> > >> The basic idea is to define a simple C struct, which is ABI-stable and
> > >> describes an Arrow away adequately.  The struct can be stack-allocated.
> > >> Its definition can also be copied in another project (or interfaced with
> > >> using a C FFI layer, depending on the language).
> > >>
> > >> There is no formal proposal, this message is meant to stir the
> > discussion.
> > >>
> > >> Issues to work out:
> > >>
> > >> * Memory lifetime issues: where Python simply associates the Py_buffer
> > >> with a PyObject owner (a garbage-collected Python object), we need
> > >> another means to control lifetime of pointed areas.  One simple
> > >> possibility is to include a destructor function pointer in the protocol
> > >> struct.
> > >>
> > >> * Arrow type representation.  We probably need some kind of "format"
> > >> mini-language to represent Arrow types, so that a type can be described
> > >> using a `const char*`.  Ideally, primitives types at least should be
> > >> trivially parsable.  We may take inspiration from Python here (`struct`
> > >> module format characters, PEP 3118 format additions).
> > >>
> > >> Example C struct definition (not a formal proposal!):
> > >>
> > >> struct ArrowBuffer {
> > >>   void* data;
> > >>   int64_t nbytes;
> > >>   // Called by the consumer when it

[jira] [Created] (ARROW-6621) [Rust][DataFusion] Examples for DataFusion are not executed in CI

2019-09-19 Thread Paddy Horan (Jira)

Paddy Horan created ARROW-6621:
--

 Summary: [Rust][DataFusion] Examples for DataFusion are not 
executed in CI
 Key: ARROW-6621
 URL: https://issues.apache.org/jira/browse/ARROW-6621
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Affects Versions: 0.14.1
Reporter: Paddy Horan


See the CI scripts, we already test the examples for the Arrow sub-crate



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Zhuo Peng

Hi Antoine,

I'm also interested in a stable ABI (previously I posted on this mailing
list about the ABI issues I had [1]). Does having such an ABI-stable
C-struct imply that there will be a set of C-APIs exposed by the Arrow
(C++) library (which I think would lead to a solution to all the inherit
ABI issues caused by C++)?

[1]
https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E

On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou  wrote:

>
> Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
> > I like the idea of a stable ABI for in-processing  that can be used for
> in
> > process communication.  For instance, there was a recent question on
> > stack-overflow on how to solve this [1].
> >
> > A couple of thoughts/questions:
> > * Would ArrowArray also need a self reference for children arrays?
>
> Yes, I forgot that.  I also think we don't need a separate Buffer
> struct, instead the Array struct should own all its buffers.
>
> > * Should transferring key-value metadata be in scope?
>
> Yes.  It could either be in the format string or a separate string.  The
> upside of a separate string is that a consumer may ignore it trivially
> if it doesn't need the information.
>
> Another open question is for nested types: does the format string
> represent the entire type including children?  Or must child types be
> read in the child arrays?  If we mimick ArrayData, then the format
> string should represent the entire type; it will then be more complex to
> parse.
>
> We should also make sure that extension types fit in the protocol.
>
> > * Should the API more closely align the IPC spec (pass a schema
> separately
> > and list of buffers instead of individual arrays)?
>
> Then you have that's not immediately usable (you have to do some
> processing to reconstitute the individual arrays).  One goal here is to
> minimize implementation costs for producers and consumers.  The
> assumption is a data model similar to the C++ ArrowData model; do we
> have implementations that use an entirely different model?  Perhaps I
> should take a look :-)
>
> Note that the draft I posted only concerns arrays.  We may also want to
> have a C struct for batches or tables.
>
> Regards
>
> Antoine.
>
>
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220
> >
> > On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> Hello,
> >>
> >> One thing that was discussed in the sync call is the ability to easily
> >> pass arrays at runtime between Arrow implementations or Arrow-supporting
> >> libraries in the same process, without bearing the cost of linking to
> >> e.g. the C++ Arrow library.
> >>
> >> (for example: "Duckdb wants to provide an option to return Arrow data of
> >> result sets, but they don't like having Arrow as a dependency")
> >>
> >> One possibility would be to define a C-level protocol similar in spirit
> >> to the Python buffer protocol, which some people may be familiar with
> (*).
> >>
> >> The basic idea is to define a simple C struct, which is ABI-stable and
> >> describes an Arrow away adequately.  The struct can be stack-allocated.
> >> Its definition can also be copied in another project (or interfaced with
> >> using a C FFI layer, depending on the language).
> >>
> >> There is no formal proposal, this message is meant to stir the
> discussion.
> >>
> >> Issues to work out:
> >>
> >> * Memory lifetime issues: where Python simply associates the Py_buffer
> >> with a PyObject owner (a garbage-collected Python object), we need
> >> another means to control lifetime of pointed areas.  One simple
> >> possibility is to include a destructor function pointer in the protocol
> >> struct.
> >>
> >> * Arrow type representation.  We probably need some kind of "format"
> >> mini-language to represent Arrow types, so that a type can be described
> >> using a `const char*`.  Ideally, primitives types at least should be
> >> trivially parsable.  We may take inspiration from Python here (`struct`
> >> module format characters, PEP 3118 format additions).
> >>
> >> Example C struct definition (not a formal proposal!):
> >>
> >> struct ArrowBuffer {
> >>   void* data;
> >>   int64_t nbytes;
> >>   // Called by the consumer when it doesn't need the buffer anymore
> >>   void (*release)(struct ArrowBuffer*);
> >>   // Opaque user data (for e.g. the release callback)
> >>   void* user_data;
> >> };
> >>
> >> struct ArrowArray {
> >>   // Type description
> >>   const char* format;
> >>   // Data description
> >>   int64_t length;
> >>   int64_t null_count;
> >>   int64_t n_buffers;
> >>   // Note: this pointers are probably owned by the ArrowArray struct
> >>   // and will be released and free()ed by the release callback.
> >>   struct BufferDescriptor* buffers;
> >>   struct ArrowDescriptor* dictionary;
> >>   // Called by

Re: [DISCUSS] Improving Arrow columnar implementation guidelines for third parties

2019-09-19 Thread Antoine Pitrou

Le 19/09/2019 à 17:33, Wes McKinney a écrit :
> On Thu, Sep 19, 2019 at 2:01 AM Micah Kornfield  wrote:
>>
>> Wes,
>> Let me see if I understand, I think there are two issues:
>> 1.  Ensuring conformance of interoperability and actually having people
>> understand what Arrow actually is and what it is not.
>> 2.  Having users adopt reference implementations and surrounding libraries.
>>
>> For 1, I agree we should have a way of measuring things here.  I think
>> being able to document the requirements of our test-suite and have it
>> generate a report on features supported would go a long way to letting
>> users understand the quality of both internal/external implementations.  It
>> seems like there is still a lot of misunderstanding of what Arrow is and
>> how it relates to other technologies.  An example of this is a recent Julia
>> thread [1], which seems to have both some misinformed commentary and
>> potentially some points that we could improve upon as a community.
>> Hopefully, some of this will be helped by separately versioning the
>> specification and the libraries post 1.0.0.
> 
> Thanks for the pointer to the thread. I've been trying for a couple of
> years to engage with the Julia community.
> 
> The bottom line is that I think it's important to highlight that
> compatibility or interoperability will not be achieved by hand-waving.
> There's a couple of things we can do

What we discussed in the sync call is that by providing a C-level data
protocol (see discussion thread), we can allow any runtime with a C FFI
facility to easily experiment and interface with Arrow data (as a
producer and/or as a consumer).

This would have a reasonable implementation cost for us and hopefully
also for users of this data protocol.  Also, it is effectively a
zero-dependency solution, since the C struct definition can be pasted in
the target project's source code (or translated in the preferred local
form, e.g. ctypes definitions in Python).

C FFIs do not always have the best performance (depending on the
impedance mismatch between static C data and the runtime's own data
model), but that would still be a good starting point, and in many cases
it might be good enough.

Regards

Antoine.

Re: Build issues on macOS [newbie]

2019-09-19 Thread Uwe L. Korn

Hello Tarek,

this error message is normally the one you get when CONDA_BUILD_SYSROOT doesn't 
point to your 10.9 SDK. Please delete your build folder again and do `export 
CONDA_BUILD_SYSROOT=..` immediately before running cmake. Running e.g. a conda 
install will sadly reset this variable to something different and break the 
build.

As a sidenote: It looks like in 1-2 months that conda-forge will get rid of the 
SDK requirement, then this will be a bit simpler.

Cheers
Uwe

On Thu, Sep 19, 2019, at 5:24 PM, Tarek Allam Jr. wrote:
> 
> Hi all,
> 
> Firstly I must apologies if what I put here is extremely trivial, but I am a
> complete newcomer to the Apache Arrow project and contributing to Apache in
> general, but I am very keen to get involved.
> 
> I'm hoping to help where I can so I recently attempted to complete a build
> following the instructions laid out in the 'Python Development' section of the
> documentation here:
> 
> After completing the steps that specifically uses Conda I was able to create 
> an
> environment but when it comes to building I am unable to do so.
> 
> I am on macOS -- 10.14.6 and as outlined in the docs and here 
> (https://stackoverflow.com/a/55798942/4521950) I used use 10.9.sdk 
> instead
> of the latest. I have both added this manually using ccmake and also 
> defining it
> like so:
> 
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>   -DCMAKE_INSTALL_LIBDIR=lib \
>   -DARROW_FLIGHT=ON \
>   -DARROW_GANDIVA=ON \
>   -DARROW_ORC=ON \
>   -DARROW_PARQUET=ON \
>   -DARROW_PYTHON=ON \
>   -DARROW_PLASMA=ON \
>   -DARROW_BUILD_TESTS=ON \
>   -DCONDA_BUILD_SYSROOT=/opt/MacOSX10.9.sdk \
>   -DARROW_DEPENDENCY_SOURCE=AUTO \
>   ..
> 
> But it seems that whatever I try, I seem to get errors, the main only tripping
> me up at the moment is:
> 
> -- Building using CMake version: 3.15.3
> -- The C compiler identification is Clang 4.0.1
> -- The CXX compiler identification is Clang 4.0.1
> -- Check for working C compiler: 
> /usr/local/anaconda3/envs/pyarrow-dev/bin/clang
> -- Check for working C compiler: 
> /usr/local/anaconda3/envs/pyarrow-dev/bin/clang -- broken
> CMake Error at 
> /usr/local/anaconda3/envs/pyarrow-dev/share/cmake-3.15/Modules/CMakeTestCCompiler.cmake:60
>  (message):
>   The C compiler
> 
> "/usr/local/anaconda3/envs/pyarrow-dev/bin/clang"
> 
>   is not able to compile a simple test program.
> 
>   It fails with the following output:
> 
> Change Dir: /Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp
> 
> Run Build Command(s):/usr/local/bin/gmake cmTC_b252c/fast && 
> /usr/local/bin/gmake -f CMakeFiles/cmTC_b252c.dir/build.make 
> CMakeFiles/cmTC_b252c.dir/build
> gmake[1]: Entering directory 
> '/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp'
> Building C object CMakeFiles/cmTC_b252c.dir/testCCompiler.c.o
> /usr/local/anaconda3/envs/pyarrow-dev/bin/clang   -march=core2 
> -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE 
> -fstack-protector-strong -O2 -pipe  -isysroot /opt/MacOSX10.9.sdk   -o 
> CMakeFiles/cmTC_b252c.dir/testCCompiler.c.o   -c 
> /Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp/testCCompiler.c
> Linking C executable cmTC_b252c
> /usr/local/anaconda3/envs/pyarrow-dev/bin/cmake -E 
> cmake_link_script CMakeFiles/cmTC_b252c.dir/link.txt --verbose=1
> /usr/local/anaconda3/envs/pyarrow-dev/bin/clang -march=core2 
> -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE 
> -fstack-protector-strong -O2 -pipe  -isysroot /opt/MacOSX10.9.sdk 
> -Wl,-search_paths_first -Wl,-headerpad_max_install_names -Wl,-pie 
> -Wl,-headerpad_max_install_names -Wl,-dead_strip_dylibs  
> CMakeFiles/cmTC_b252c.dir/testCCompiler.c.o  -o cmTC_b252c
> ld: warning: ignoring file 
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd,
>  file was built for unsupported file format ( 0x2D 0x2D 0x2D 0x20 0x21 0x74 
> 0x61 0x70 0x69 0x2D 0x74 0x62 0x64 0x2D 0x76 0x33 ) which is not the 
> architecture being linked (x86_64): 
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd
> ld: dynamic main executables must link with libSystem.dylib for 
> architecture x86_64
> clang-4.0: error: linker command failed with exit code 1 (use -v to 
> see invocation)
> gmake[1]: *** [CMakeFiles/cmTC_b252c.dir/build.make:87: cmTC_b252c] 
> Error 1
> gmake[1]: Leaving directory 
> '/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp'
> gmake: *** [Makefile:121: cmTC_b252c/fast] Error 2
> 
> 
>   CMake will not be able to correctly generate this project.
> Call Stack (most recent call first):
>   CMakeLists.txt:32 (project)
> 
> -- Configuring incomplete, errors occurred!
> See also "/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeOutput.log".
> See also

Re: Build issues on macOS [newbie]

2019-09-19 Thread Neal Richardson

Others have different experience, but I ultimately had better luck
using Homebrew for system dependencies rather than Conda.

You might also want to start with more cmake flags OFF, just to keep
things simpler while you work out the basic environment. Though your
current error is happening before any of that would matter.

Neal


On Thu, Sep 19, 2019 at 8:39 AM Tarek Allam Jr.  wrote:
>
>
> Hi all,
>
> Firstly I must apologies if what I put here is extremely trivial, but I am a
> complete newcomer to the Apache Arrow project and contributing to Apache in
> general, but I am very keen to get involved.
>
> I'm hoping to help where I can so I recently attempted to complete a build
> following the instructions laid out in the 'Python Development' section of the
> documentation here:
>
> After completing the steps that specifically uses Conda I was able to create 
> an
> environment but when it comes to building I am unable to do so.
>
> I am on macOS -- 10.14.6 and as outlined in the docs and here 
> (https://stackoverflow.com/a/55798942/4521950) I used use 10.9.sdk instead
> of the latest. I have both added this manually using ccmake and also defining 
> it
> like so:
>
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>   -DCMAKE_INSTALL_LIBDIR=lib \
>   -DARROW_FLIGHT=ON \
>   -DARROW_GANDIVA=ON \
>   -DARROW_ORC=ON \
>   -DARROW_PARQUET=ON \
>   -DARROW_PYTHON=ON \
>   -DARROW_PLASMA=ON \
>   -DARROW_BUILD_TESTS=ON \
>   -DCONDA_BUILD_SYSROOT=/opt/MacOSX10.9.sdk \
>   -DARROW_DEPENDENCY_SOURCE=AUTO \
>   ..
>
> But it seems that whatever I try, I seem to get errors, the main only tripping
> me up at the moment is:
>
> -- Building using CMake version: 3.15.3
> -- The C compiler identification is Clang 4.0.1
> -- The CXX compiler identification is Clang 4.0.1
> -- Check for working C compiler: 
> /usr/local/anaconda3/envs/pyarrow-dev/bin/clang
> -- Check for working C compiler: 
> /usr/local/anaconda3/envs/pyarrow-dev/bin/clang -- broken
> CMake Error at 
> /usr/local/anaconda3/envs/pyarrow-dev/share/cmake-3.15/Modules/CMakeTestCCompiler.cmake:60
>  (message):
>   The C compiler
>
> "/usr/local/anaconda3/envs/pyarrow-dev/bin/clang"
>
>   is not able to compile a simple test program.
>
>   It fails with the following output:
>
> Change Dir: /Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp
>
> Run Build Command(s):/usr/local/bin/gmake cmTC_b252c/fast && 
> /usr/local/bin/gmake -f CMakeFiles/cmTC_b252c.dir/build.make 
> CMakeFiles/cmTC_b252c.dir/build
> gmake[1]: Entering directory 
> '/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp'
> Building C object CMakeFiles/cmTC_b252c.dir/testCCompiler.c.o
> /usr/local/anaconda3/envs/pyarrow-dev/bin/clang   -march=core2 
> -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong 
> -O2 -pipe  -isysroot /opt/MacOSX10.9.sdk   -o 
> CMakeFiles/cmTC_b252c.dir/testCCompiler.c.o   -c 
> /Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp/testCCompiler.c
> Linking C executable cmTC_b252c
> /usr/local/anaconda3/envs/pyarrow-dev/bin/cmake -E cmake_link_script 
> CMakeFiles/cmTC_b252c.dir/link.txt --verbose=1
> /usr/local/anaconda3/envs/pyarrow-dev/bin/clang -march=core2 
> -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong 
> -O2 -pipe  -isysroot /opt/MacOSX10.9.sdk -Wl,-search_paths_first 
> -Wl,-headerpad_max_install_names -Wl,-pie -Wl,-headerpad_max_install_names 
> -Wl,-dead_strip_dylibs  CMakeFiles/cmTC_b252c.dir/testCCompiler.c.o  -o 
> cmTC_b252c
> ld: warning: ignoring file 
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd,
>  file was built for unsupported file format ( 0x2D 0x2D 0x2D 0x20 0x21 0x74 
> 0x61 0x70 0x69 0x2D 0x74 0x62 0x64 0x2D 0x76 0x33 ) which is not the 
> architecture being linked (x86_64): 
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd
> ld: dynamic main executables must link with libSystem.dylib for 
> architecture x86_64
> clang-4.0: error: linker command failed with exit code 1 (use -v to see 
> invocation)
> gmake[1]: *** [CMakeFiles/cmTC_b252c.dir/build.make:87: cmTC_b252c] Error 
> 1
> gmake[1]: Leaving directory 
> '/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp'
> gmake: *** [Makefile:121: cmTC_b252c/fast] Error 2
>
>
>   CMake will not be able to correctly generate this project.
> Call Stack (most recent call first):
>   CMakeLists.txt:32 (project)
>
> -- Configuring incomplete, errors occurred!
> See also "/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeOutput.log".
> See also "/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeError.log".
>
> Does anyone have any insight as to what might be happening and causing this to
> fail. I notice that eventhough I set to CONDA_BUILD_SYSROOT to
>

Re: Timeline for 0.15.0 release

2019-09-19 Thread Wes McKinney

On Thu, Sep 19, 2019 at 12:13 AM Micah Kornfield  wrote:
>>
>> The process should be well documented at this point but there are a
>> number of steps.
>
> Is [1] the up-to-date documentation for the release?   Are there instructions 
> for the adding the code signing Key to SVN?
>
> I will make a go of it.  i will try to mitigate any internet issues by doing 
> the process for a cloud instance (I assume that isn't a problem?).
>

Setting up a new cloud environment suitable for producing an RC may be
time consuming, but you are welcome to try. Krisztian -- are you
available next week to help Micah and potentially take over producing
the RC if there are issues?

> Thanks,
> Micah
>
> [1] https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
>
> On Wed, Sep 18, 2019 at 8:29 AM Wes McKinney  wrote:
>>
>> The process should be well documented at this point but there are a
>> number of steps. Note that you need to add your code signing key to
>> the KEYS file in SVN (that's not very hard to do). I think it's fine
>> to hand off the process to others after the VOTE but it would be
>> tricky to have multiple RMs involved with producing the source and
>> binary artifacts for the vote
>>
>> On Tue, Sep 17, 2019 at 10:55 PM Micah Kornfield  
>> wrote:
>> >
>> > SGTM, as well.
>> >
>> > I should have a little bit of time next week if I can help as RM but I have
>> > a couple of concerns:
>> > 1.  In the past I've had trouble downloading and validating releases. I'm a
>> > bit worried, that I might have similar problems doing the necessary 
>> > uploads.
>> > 2.  My internet connection will likely be not great, I don't know if this
>> > would make it even less likely to be successful.
>> >
>> > Does it become problematic if somehow I would have to abandon the process
>> > mid-release?  Is there anyone who could serve as a backup?  Are the steps
>> > well documented?
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Tue, Sep 17, 2019 at 4:25 PM Neal Richardson 
>> > 
>> > wrote:
>> >
>> > > Sounds good to me.
>> > >
>> > > Do we have a release manager yet? Any volunteers?
>> > >
>> > > Neal
>> > >
>> > > On Tue, Sep 17, 2019 at 4:06 PM Wes McKinney  wrote:
>> > >
>> > > > hi all,
>> > > >
>> > > > It looks like we're drawing close to be able to make the 0.15.0
>> > > > release. I would suggest "pencils down" at the end of this week and
>> > > > see if a release candidate can be produced next Monday September 23.
>> > > > Any thoughts or objections?
>> > > >
>> > > > Thanks,
>> > > > Wes
>> > > >
>> > > > On Wed, Sep 11, 2019 at 11:23 AM Wes McKinney 
>> > > wrote:
>> > > > >
>> > > > > hi Eric -- yes, that's correct. I'm planning to amend the Format docs
>> > > > > today regarding the EOS issue and also update the C++ library
>> > > > >
>> > > > > On Wed, Sep 11, 2019 at 11:21 AM Eric Erhardt
>> > > > >  wrote:
>> > > > > >
>> > > > > > I assume the plan is to merge the ARROW-6313-flatbuffer-alignment
>> > > > branch into master before the 0.15 release, correct?
>> > > > > >
>> > > > > > BTW - I believe the C# alignment changes are ready to be merged 
>> > > > > > into
>> > > > the alignment branch -  https://github.com/apache/arrow/pull/5280/
>> > > > > >
>> > > > > > Eric
>> > > > > >
>> > > > > > -Original Message-
>> > > > > > From: Micah Kornfield 
>> > > > > > Sent: Tuesday, September 10, 2019 10:24 PM
>> > > > > > To: Wes McKinney 
>> > > > > > Cc: dev ; niki.lj 
>> > > > > > Subject: Re: Timeline for 0.15.0 release
>> > > > > >
>> > > > > > I should have a little more bandwidth to help with some of the
>> > > > packaging starting tomorrow and going into the weekend.
>> > > > > >
>> > > > > > On Tuesday, September 10, 2019, Wes McKinney 
>> > > > wrote:
>> > > > > >
>> > > > > > > Hi folks,
>> > > > > > >
>> > > > > > > With the state of nightly packaging and integration builds things
>> > > > > > > aren't looking too good for being in release readiness by the end
>> > > of
>> > > > > > > this week but maybe I'm wrong. I'm planning to be working to 
>> > > > > > > close
>> > > as
>> > > > > > > many issues as I can and also to help with the ongoing alignment
>> > > > fixes.
>> > > > > > >
>> > > > > > > Wes
>> > > > > > >
>> > > > > > > On Thu, Sep 5, 2019, 11:07 PM Micah Kornfield <
>> > > emkornfi...@gmail.com
>> > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > >> Just for reference [1] has a dashboard of the current issues:
>> > > > > > >>
>> > > > > > >>
>> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwi
>> > > > > > >> ki.apache.org
>> > > > %2Fconfluence%2Fdisplay%2FARROW%2FArrow%2B0.15.0%2BRelea
>> > > > > > >> sedata=02%7C01%7CEric.Erhardt%40microsoft.com
>> > > > %7Ccbead81a42104034
>> > > > > > >>
>> > > > a4f308d736678a45%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6370376
>> > > > > > >>
>> > > > 90648216338sdata=0Upux3i%2B9X6f8uanGKSGM5VYxR6c2ADWrxSPi1%2FgbH4
>> > > > > > >> %3Dreserved=0
>> > > > > > >>
>> > > > > > >> On Thu, Sep 5, 2019

[jira] [Created] (ARROW-6620) [Python][CI] pandas-master build failing due to removal of "to_sparse" method

2019-09-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6620:
---

 Summary: [Python][CI] pandas-master build failing due to removal 
of "to_sparse" method
 Key: ARROW-6620
 URL: https://issues.apache.org/jira/browse/ARROW-6620
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.15.0


See nightly build failure

https://circleci.com/gh/ursa-labs/crossbow/3046?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [DISCUSS] Improving Arrow columnar implementation guidelines for third parties

2019-09-19 Thread Wes McKinney

On Thu, Sep 19, 2019 at 2:01 AM Micah Kornfield  wrote:
>
> Wes,
> Let me see if I understand, I think there are two issues:
> 1.  Ensuring conformance of interoperability and actually having people
> understand what Arrow actually is and what it is not.
> 2.  Having users adopt reference implementations and surrounding libraries.
>
> For 1, I agree we should have a way of measuring things here.  I think
> being able to document the requirements of our test-suite and have it
> generate a report on features supported would go a long way to letting
> users understand the quality of both internal/external implementations.  It
> seems like there is still a lot of misunderstanding of what Arrow is and
> how it relates to other technologies.  An example of this is a recent Julia
> thread [1], which seems to have both some misinformed commentary and
> potentially some points that we could improve upon as a community.
> Hopefully, some of this will be helped by separately versioning the
> specification and the libraries post 1.0.0.

Thanks for the pointer to the thread. I've been trying for a couple of
years to engage with the Julia community.

The bottom line is that I think it's important to highlight that
compatibility or interoperability will not be achieved by hand-waving.
There's a couple of things we can do

* In our "implementation guidelines", indicate the procedure for third
party implementations to validate themselves against the reference
implementations
* Recommend that third party implementations advertise their feature
coverage and degree of compatibility/integration testing

> For 2, I agree that having people adopt our code (and hopefully contribute
> back) is the ideal situation.  I think there are likely a few challenges
> here:
> *  How amenable are existing libraries to embedding. Wes, you've started
> other threads on how to make this adoption easier on the C++ side.

Yes, I think unfortunately some people have gotten the mistaken
impression that the complexity involved with creating and deploying a
comprehensive build of everything we have in the project is being
foisted on to each developer or user of the project, no matter how
limited their use of the Arrow format. If that becomes the rationale
for creating third party implementations that is very sad indeed.

This can be corrected by better developer documentation to clarify the
different "routes"

* Minimal builds, for people who just want to use the columnar format
and protocol. I think having a "zero dependency" out of the box C++
build would help address this (we can discuss this more in the
separate thread). The current out of the box experience may be a bit
off-putting to some users because a number of optional components are
being built by default, see https://github.com/apache/arrow/pull/5431
* Comprehensive builds, for people who are contributing to the Apache
project and need to be able to build everything

I think we have invested our time in documenting the latter at the
expense of the former.

> * How much of a value proposition there is in the reference libraries.
> Arrow has seen good adoption in Python due to its support for Parquet and
> Feather.  I assume as the dataset and other projects get flushed out this
> will lead to further adoption.  Conformance to the specification is a
> feature as well, but I would guess its less important to many of the end
> users of pyarrow who see it as a way of integrating with other non-arrow
> technologies.

We're at a significant documentation and communication deficit
relative to the development we've completed in the project. As much as
I'd like to push personally on new features, I'm going to make time to
write more documentation and blog posts to help communicate the value
of the work we've done.

> * Technical limitation of the specification (for example some processing
> engines do need alternative encodings like RLE).

I agree, and so I think it's important that we pursue the "encoded
record batch" proposal and get something codified in the near future.
Once the 0.15.0 release is behind us I hope to take a closer look and
help drive that forward.

> Am I understanding your points?
>
> Thanks,
> Micah
>
> [1] https://discourse.julialang.org/t/arrow-feather-and-parquet/28739
>
> On Tue, Sep 17, 2019 at 6:00 PM Wes McKinney  wrote:
>
> > On Tue, Sep 17, 2019 at 7:09 PM Jacques Nadeau  wrote:
> > >
> > > >
> > > > Let's take an example:
> > > >
> > > > * Dremio can execute SQL and uses Arrow as its native runtime format
> > > > * Apache Spark can execute SQL and offers UDF support with Arrow
> > > > format, i.e. so using Arrow for IO
> > > >
> > > > Both of these projects can say that they "use Apache Arrow", but the
> > > > extent to which Arrow is a key ingredient may not be obvious to the
> > > > average onlooker. To have more "Arrow-native" systems seems like one
> > > > of the missions of the project.
> > > >
> > >
> > > I'm not following you here. Are you suggesting that

Build issues on macOS [newbie]

2019-09-19 Thread Tarek Allam Jr .



Hi all,

Firstly I must apologies if what I put here is extremely trivial, but I am a
complete newcomer to the Apache Arrow project and contributing to Apache in
general, but I am very keen to get involved.

I'm hoping to help where I can so I recently attempted to complete a build
following the instructions laid out in the 'Python Development' section of the
documentation here:

After completing the steps that specifically uses Conda I was able to create an
environment but when it comes to building I am unable to do so.

I am on macOS -- 10.14.6 and as outlined in the docs and here 
(https://stackoverflow.com/a/55798942/4521950) I used use 10.9.sdk instead
of the latest. I have both added this manually using ccmake and also defining it
like so:

cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
  -DCMAKE_INSTALL_LIBDIR=lib \
  -DARROW_FLIGHT=ON \
  -DARROW_GANDIVA=ON \
  -DARROW_ORC=ON \
  -DARROW_PARQUET=ON \
  -DARROW_PYTHON=ON \
  -DARROW_PLASMA=ON \
  -DARROW_BUILD_TESTS=ON \
  -DCONDA_BUILD_SYSROOT=/opt/MacOSX10.9.sdk \
  -DARROW_DEPENDENCY_SOURCE=AUTO \
  ..

But it seems that whatever I try, I seem to get errors, the main only tripping
me up at the moment is:

-- Building using CMake version: 3.15.3
-- The C compiler identification is Clang 4.0.1
-- The CXX compiler identification is Clang 4.0.1
-- Check for working C compiler: /usr/local/anaconda3/envs/pyarrow-dev/bin/clang
-- Check for working C compiler: 
/usr/local/anaconda3/envs/pyarrow-dev/bin/clang -- broken
CMake Error at 
/usr/local/anaconda3/envs/pyarrow-dev/share/cmake-3.15/Modules/CMakeTestCCompiler.cmake:60
 (message):
  The C compiler

"/usr/local/anaconda3/envs/pyarrow-dev/bin/clang"

  is not able to compile a simple test program.

  It fails with the following output:

Change Dir: /Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/local/bin/gmake cmTC_b252c/fast && 
/usr/local/bin/gmake -f CMakeFiles/cmTC_b252c.dir/build.make 
CMakeFiles/cmTC_b252c.dir/build
gmake[1]: Entering directory 
'/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp'
Building C object CMakeFiles/cmTC_b252c.dir/testCCompiler.c.o
/usr/local/anaconda3/envs/pyarrow-dev/bin/clang   -march=core2 
-mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong 
-O2 -pipe  -isysroot /opt/MacOSX10.9.sdk   -o 
CMakeFiles/cmTC_b252c.dir/testCCompiler.c.o   -c 
/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp/testCCompiler.c
Linking C executable cmTC_b252c
/usr/local/anaconda3/envs/pyarrow-dev/bin/cmake -E cmake_link_script 
CMakeFiles/cmTC_b252c.dir/link.txt --verbose=1
/usr/local/anaconda3/envs/pyarrow-dev/bin/clang -march=core2 -mtune=haswell 
-mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe  
-isysroot /opt/MacOSX10.9.sdk -Wl,-search_paths_first 
-Wl,-headerpad_max_install_names -Wl,-pie -Wl,-headerpad_max_install_names 
-Wl,-dead_strip_dylibs  CMakeFiles/cmTC_b252c.dir/testCCompiler.c.o  -o 
cmTC_b252c
ld: warning: ignoring file 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd,
 file was built for unsupported file format ( 0x2D 0x2D 0x2D 0x20 0x21 0x74 
0x61 0x70 0x69 0x2D 0x74 0x62 0x64 0x2D 0x76 0x33 ) which is not the 
architecture being linked (x86_64): 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd
ld: dynamic main executables must link with libSystem.dylib for 
architecture x86_64
clang-4.0: error: linker command failed with exit code 1 (use -v to see 
invocation)
gmake[1]: *** [CMakeFiles/cmTC_b252c.dir/build.make:87: cmTC_b252c] Error 1
gmake[1]: Leaving directory 
'/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp'
gmake: *** [Makefile:121: cmTC_b252c/fast] Error 2


  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:32 (project)

-- Configuring incomplete, errors occurred!
See also "/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeOutput.log".
See also "/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeError.log".

Does anyone have any insight as to what might be happening and causing this to
fail. I notice that eventhough I set to CONDA_BUILD_SYSROOT to
/opt/MacOSX10.9.sdk I also see that ld is looking in MacOSX10.14.sdk, I assume
that is not right..

I have tried to compare steps with ones outlined in
https://lists.apache.org/list.html?dev@arrow.apache.org:2019-8 and in other
corners of the internet but I feel very stuck at the moment. 

Any help would be greatly appreciated! Thank you

[jira] [Created] (ARROW-6619) [Ruby] Add support for building Gandiva::Expression by Arrow::Schema#build_expression

2019-09-19 Thread Yosuke Shiro (Jira)

Yosuke Shiro created ARROW-6619:
---

 Summary: [Ruby] Add support for building Gandiva::Expression by 
Arrow::Schema#build_expression
 Key: ARROW-6619
 URL: https://issues.apache.org/jira/browse/ARROW-6619
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Ruby
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro


This is the first attempt to make Red Gandiva API better.
This adds Arrow::Schema#build_expression, which aims to build 
Gandiva::Expression with FunctionNode or IfNode easily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: Arrow sync call September 19 at 12:00 US/Eastern, 16:00 UTC

2019-09-19 Thread Neal Richardson

A little late but here are the notes from the call.

Attendees:
Ben Kietzman
Micah Kornfield
Rok Mihevc
Prudhvi Porandla
Antoine Pitrou
Neal Richardson
François Saint-Jacques

Discussion:
* Arrow compatibility branding and minimal packaging discussion
threads: talked about problems and possible solutions, will follow up
on ML (some of this has already started)
* 0.15 release: reviewed status and outstanding issues

On Wed, Sep 18, 2019 at 8:14 AM Wes McKinney  wrote:
>
> I'm unable to join today but hope that participants can review the
> active DISCUSS threads
>
> On Tue, Sep 17, 2019 at 11:28 PM Neal Richardson
>  wrote:
> >
> > Hi all,
> > Belated reminder that the biweekly Arrow call is coming up in less than 12
> > hours at https://meet.google.com/vtm-teks-phx. All are welcome to join.
> > Notes will be sent out to the mailing list afterwards.
> >
> > Neal

[NIGHTLY] Arrow Build Report for Job nightly-2019-09-19-0

2019-09-19 Thread Crossbow



Arrow Build Report for Job nightly-2019-09-19-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0

Failed Tasks:
- docker-dask-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-dask-integration
- wheel-osx-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-osx-cp27m
- docker-cpp-fuzzit:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-cpp-fuzzit
- docker-spark-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-spark-integration
- docker-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-pandas-master

Succeeded Tasks:
- docker-r:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-r
- docker-lint:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-lint
- wheel-manylinux1-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux1-cp27mu
- wheel-manylinux2010-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux2010-cp36m
- wheel-manylinux1-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux1-cp37m
- wheel-manylinux2010-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux2010-cp27mu
- wheel-manylinux2010-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux2010-cp27m
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-osx-cp35m
- ubuntu-xenial-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-azure-ubuntu-xenial-arm64
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-azure-conda-linux-gcc-py36
- docker-rust:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-rust
- docker-r-conda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-r-conda
- docker-cpp-release:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-cpp-release
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-azure-conda-linux-gcc-py27
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-gandiva-jar-osx
- wheel-win-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-appveyor-wheel-win-cp35m
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux2010-cp37m
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-azure-debian-stretch
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-appveyor-wheel-win-cp36m
- docker-python-2.7-nopandas:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-python-2.7-nopandas
- docker-clang-format:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-clang-format
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-osx-cp37m
- wheel-manylinux1-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux1-cp36m
- docker-cpp-static-only:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-cpp-static-only
- docker-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-python-3.6
- docker-hdfs-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-circle-docker-hdfs-integration
- wheel-manylinux1-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-travis-wheel-manylinux1-cp35m
- ubuntu-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-azure-ubuntu-xenial
- ubuntu-disco-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-azure-ubuntu-disco-arm64
- ubuntu-bionic-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-19-0-azure-ubuntu-bionic-arm64
- conda-osx-clang-py37:
  URL:

[jira] [Created] (ARROW-6617) [Crossbow] Unify the version numbers generated by crossbow and rake

2019-09-19 Thread Krisztian Szucs (Jira)

Krisztian Szucs created ARROW-6617:
--

 Summary: [Crossbow] Unify the version numbers generated by 
crossbow and rake
 Key: ARROW-6617
 URL: https://issues.apache.org/jira/browse/ARROW-6617
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs


Crossbow's default package version (0.14.0.dev584) and rake apt:build/rake 
yum:build's default package version (0.15.0-dev20190918) are different. We need 
to unify them, and prefer the latter one.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Antoine Pitrou



Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
> I like the idea of a stable ABI for in-processing  that can be used for in
> process communication.  For instance, there was a recent question on
> stack-overflow on how to solve this [1].
> 
> A couple of thoughts/questions:
> * Would ArrowArray also need a self reference for children arrays?

Yes, I forgot that.  I also think we don't need a separate Buffer
struct, instead the Array struct should own all its buffers.

> * Should transferring key-value metadata be in scope?

Yes.  It could either be in the format string or a separate string.  The
upside of a separate string is that a consumer may ignore it trivially
if it doesn't need the information.

Another open question is for nested types: does the format string
represent the entire type including children?  Or must child types be
read in the child arrays?  If we mimick ArrayData, then the format
string should represent the entire type; it will then be more complex to
parse.

We should also make sure that extension types fit in the protocol.

> * Should the API more closely align the IPC spec (pass a schema separately
> and list of buffers instead of individual arrays)?

Then you have that's not immediately usable (you have to do some
processing to reconstitute the individual arrays).  One goal here is to
minimize implementation costs for producers and consumers.  The
assumption is a data model similar to the C++ ArrowData model; do we
have implementations that use an entirely different model?  Perhaps I
should take a look :-)

Note that the draft I posted only concerns arrays.  We may also want to
have a C struct for batches or tables.

Regards

Antoine.


> 
> Thanks,
> Micah
> 
> [1]
> https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220
> 
> On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou  wrote:
> 
>>
>> Hello,
>>
>> One thing that was discussed in the sync call is the ability to easily
>> pass arrays at runtime between Arrow implementations or Arrow-supporting
>> libraries in the same process, without bearing the cost of linking to
>> e.g. the C++ Arrow library.
>>
>> (for example: "Duckdb wants to provide an option to return Arrow data of
>> result sets, but they don't like having Arrow as a dependency")
>>
>> One possibility would be to define a C-level protocol similar in spirit
>> to the Python buffer protocol, which some people may be familiar with (*).
>>
>> The basic idea is to define a simple C struct, which is ABI-stable and
>> describes an Arrow away adequately.  The struct can be stack-allocated.
>> Its definition can also be copied in another project (or interfaced with
>> using a C FFI layer, depending on the language).
>>
>> There is no formal proposal, this message is meant to stir the discussion.
>>
>> Issues to work out:
>>
>> * Memory lifetime issues: where Python simply associates the Py_buffer
>> with a PyObject owner (a garbage-collected Python object), we need
>> another means to control lifetime of pointed areas.  One simple
>> possibility is to include a destructor function pointer in the protocol
>> struct.
>>
>> * Arrow type representation.  We probably need some kind of "format"
>> mini-language to represent Arrow types, so that a type can be described
>> using a `const char*`.  Ideally, primitives types at least should be
>> trivially parsable.  We may take inspiration from Python here (`struct`
>> module format characters, PEP 3118 format additions).
>>
>> Example C struct definition (not a formal proposal!):
>>
>> struct ArrowBuffer {
>>   void* data;
>>   int64_t nbytes;
>>   // Called by the consumer when it doesn't need the buffer anymore
>>   void (*release)(struct ArrowBuffer*);
>>   // Opaque user data (for e.g. the release callback)
>>   void* user_data;
>> };
>>
>> struct ArrowArray {
>>   // Type description
>>   const char* format;
>>   // Data description
>>   int64_t length;
>>   int64_t null_count;
>>   int64_t n_buffers;
>>   // Note: this pointers are probably owned by the ArrowArray struct
>>   // and will be released and free()ed by the release callback.
>>   struct BufferDescriptor* buffers;
>>   struct ArrowDescriptor* dictionary;
>>   // Called by the consumer when it doesn't need the array anymore
>>   void (*release)(struct ArrowArrayDescriptor*);
>>   // Opaque user data (for e.g. the release callback)
>>   void* user_data;
>> };
>>
>> Thoughts?
>>
>> (*) For the record, the reference for the Python buffer protocol:
>> https://docs.python.org/3/c-api/buffer.html#buffer-structure
>> and its C struct definition:
>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
>>
>> Regards
>>
>> Antoine.
>>
>

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Micah Kornfield

I like the idea of a stable ABI for in-processing  that can be used for in
process communication.  For instance, there was a recent question on
stack-overflow on how to solve this [1].

A couple of thoughts/questions:
* Would ArrowArray also need a self reference for children arrays?
* Should transferring key-value metadata be in scope?
* Should the API more closely align the IPC spec (pass a schema separately
and list of buffers instead of individual arrays)?

Thanks,
Micah

[1]
https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220

On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou  wrote:

>
> Hello,
>
> One thing that was discussed in the sync call is the ability to easily
> pass arrays at runtime between Arrow implementations or Arrow-supporting
> libraries in the same process, without bearing the cost of linking to
> e.g. the C++ Arrow library.
>
> (for example: "Duckdb wants to provide an option to return Arrow data of
> result sets, but they don't like having Arrow as a dependency")
>
> One possibility would be to define a C-level protocol similar in spirit
> to the Python buffer protocol, which some people may be familiar with (*).
>
> The basic idea is to define a simple C struct, which is ABI-stable and
> describes an Arrow away adequately.  The struct can be stack-allocated.
> Its definition can also be copied in another project (or interfaced with
> using a C FFI layer, depending on the language).
>
> There is no formal proposal, this message is meant to stir the discussion.
>
> Issues to work out:
>
> * Memory lifetime issues: where Python simply associates the Py_buffer
> with a PyObject owner (a garbage-collected Python object), we need
> another means to control lifetime of pointed areas.  One simple
> possibility is to include a destructor function pointer in the protocol
> struct.
>
> * Arrow type representation.  We probably need some kind of "format"
> mini-language to represent Arrow types, so that a type can be described
> using a `const char*`.  Ideally, primitives types at least should be
> trivially parsable.  We may take inspiration from Python here (`struct`
> module format characters, PEP 3118 format additions).
>
> Example C struct definition (not a formal proposal!):
>
> struct ArrowBuffer {
>   void* data;
>   int64_t nbytes;
>   // Called by the consumer when it doesn't need the buffer anymore
>   void (*release)(struct ArrowBuffer*);
>   // Opaque user data (for e.g. the release callback)
>   void* user_data;
> };
>
> struct ArrowArray {
>   // Type description
>   const char* format;
>   // Data description
>   int64_t length;
>   int64_t null_count;
>   int64_t n_buffers;
>   // Note: this pointers are probably owned by the ArrowArray struct
>   // and will be released and free()ed by the release callback.
>   struct BufferDescriptor* buffers;
>   struct ArrowDescriptor* dictionary;
>   // Called by the consumer when it doesn't need the array anymore
>   void (*release)(struct ArrowArrayDescriptor*);
>   // Opaque user data (for e.g. the release callback)
>   void* user_data;
> };
>
> Thoughts?
>
> (*) For the record, the reference for the Python buffer protocol:
> https://docs.python.org/3/c-api/buffer.html#buffer-structure
> and its C struct definition:
> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Improving Arrow columnar implementation guidelines for third parties

2019-09-19 Thread Micah Kornfield

Wes,
Let me see if I understand, I think there are two issues:
1.  Ensuring conformance of interoperability and actually having people
understand what Arrow actually is and what it is not.
2.  Having users adopt reference implementations and surrounding libraries.

For 1, I agree we should have a way of measuring things here.  I think
being able to document the requirements of our test-suite and have it
generate a report on features supported would go a long way to letting
users understand the quality of both internal/external implementations.  It
seems like there is still a lot of misunderstanding of what Arrow is and
how it relates to other technologies.  An example of this is a recent Julia
thread [1], which seems to have both some misinformed commentary and
potentially some points that we could improve upon as a community.
Hopefully, some of this will be helped by separately versioning the
specification and the libraries post 1.0.0.

For 2, I agree that having people adopt our code (and hopefully contribute
back) is the ideal situation.  I think there are likely a few challenges
here:
*  How amenable are existing libraries to embedding. Wes, you've started
other threads on how to make this adoption easier on the C++ side.
* How much of a value proposition there is in the reference libraries.
Arrow has seen good adoption in Python due to its support for Parquet and
Feather.  I assume as the dataset and other projects get flushed out this
will lead to further adoption.  Conformance to the specification is a
feature as well, but I would guess its less important to many of the end
users of pyarrow who see it as a way of integrating with other non-arrow
technologies.
* Technical limitation of the specification (for example some processing
engines do need alternative encodings like RLE).

Am I understanding your points?

Thanks,
Micah

[1] https://discourse.julialang.org/t/arrow-feather-and-parquet/28739

On Tue, Sep 17, 2019 at 6:00 PM Wes McKinney  wrote:

> On Tue, Sep 17, 2019 at 7:09 PM Jacques Nadeau  wrote:
> >
> > >
> > > Let's take an example:
> > >
> > > * Dremio can execute SQL and uses Arrow as its native runtime format
> > > * Apache Spark can execute SQL and offers UDF support with Arrow
> > > format, i.e. so using Arrow for IO
> > >
> > > Both of these projects can say that they "use Apache Arrow", but the
> > > extent to which Arrow is a key ingredient may not be obvious to the
> > > average onlooker. To have more "Arrow-native" systems seems like one
> > > of the missions of the project.
> > >
> >
> > I'm not following you here. Are you suggesting that these systems are
> > Arrow-native or not Arrow-native? Or that one is and the other is not?
> What
> > does Arrow-native mean to you?
> >
> > Do you think there is enough problems around this right now that we need
> to
> > do something? It seems like you're concerned about people claiming they
> are
> > using Arrow when they aren't quite. Right now, it seems like the
> community
> > mostly benefits from people saying they are using Arrow. Have you seen
> > situations where users/consumers were frustrated because something was
> > Arrow but not really Arrow?
>
> I think it's good that using Arrow in some way has become a mark of
> quality for systems.
>
> My argument is mostly about brand quality control. Early on in Apache
> Arrow, some people who learned about the project asked me, essentially
> "what's the point of developing reference implementations if everyone
> 'just follows the specification'?". Even now people have said similar
> to me in the context of our occasional difficulties scaling our build
> and packaging, i.e. "why are you making your life so difficult
> building all this systems software, if the specification is all you
> really need to use Arrow?"
>
> In an extreme case, Apache Arrow could be a single Markdown document
> in a git repository describing the Arrow protocol and that's it.
>
> As a project insider who's been overseeing the development of the
> reference implementations, the prospect of a proliferation of
> implementations lacking in integration tests with each other terrifies
> me. This has already happened with the Parquet format in some ways.
>
> One of the raison d'etres of the project is interoperability. I would
> like for people to see "Arrow" and understand what they're getting, or
> at least be advised about where a project falls short of
> interoperability.
>
> - Wes
>

52 matches

Mail list logo