[jira] [Created] (ARROW-6911) [Java] Provide composite comparator

2019-10-16 Thread Liya Fan (Jira)
Liya Fan created ARROW-6911:
---

 Summary: [Java] Provide composite comparator
 Key: ARROW-6911
 URL: https://issues.apache.org/jira/browse/ARROW-6911
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


A composite comparator is a sub-class of VectorValueComparator that contains an 
array of inner comparators, with each comparator corresponding to one column 
for comparison. It can be used to support sort/comparison operations for 
VectorSchemaRoot/StructVector.

The composite comparator works like this: it first uses the first internal 
comparator (for the primary sort key) to compare vector values. If it gets a 
non-zero value, we just return it; otherwise, we use the second comparator to 
break the tie, and so on, until a non-zero value is produced by some internal 
comparator, or all internal comparators have been used. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread Micah Kornfield
Hi John and Wes,

A few thoughts:
One of the issues which we didn't get into in prior discussions, is the
proposal is essentially changing the unit of exchange from RecordBatches to
a segment of a RecordBatch.

I think I brought this up earlier in discussions, an interesting idea that
Trill [1], a columnar streaming engine, illustrates.  Over the time horizon
of desired latency if you aren't receiving enough messages to take
advantage of columnar analytics, a system probably has enough time to
compact batches after the fact for later analysis and conversely if you are
receiving many events you naturally get reasonable batch sizes without
having to do further work.


> I'm objecting to RecordBatch.length being inconsistent with the
> constituent field lengths, that's where the danger lies. If all of the
> lengths are consistent, no code changes are necessary.

John, is it  a viable solution to keep all length in sync for the use case
you are imagining?

A solution I like less, but might be viable: formally specify a negative
constant that signifies length should be inherited from RowBatch length
(this could only be used on top level fields).

I contend that it can only be useful and will never be harmful.  What are
> the counter-examples of concrete harm?


I'm not sure there is anything obviously wrong, however changes to
semantics are always dangerous.  One  blemish on the current proposal  is
one can't determine easily if a mismatch in row-length is a programming
error or intentional.

[1]
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/trill-vldb2015.pdf

On Wed, Oct 16, 2019 at 4:41 PM John Muehlhausen  wrote:

> "that's where the danger lies"
>
> What danger?  I have no idea what the specific danger is, assuming that all
> reference implementations have test cases that hedge around this.
>
> I contend that it can only be useful and will never be harmful.  What are
> the counter-examples of concrete harm?
>


Re: [Discuss][FlightRPC] Extensions to Flight: "DoBidirectional"

2019-10-16 Thread David Li
I was definitely considering having control messages without data, and
I thought that could be encoded by a FlightData with only app_metadata
set. I think I understand your position now: FlightData should always
carry (some) data (with optional metadata)?

That makes sense to me, and is consistent with the documentation on
FlightData in the Protobuf file. I was worried about having a
redundant metadata field, but oneof prevents that from happening, and
overall having a clear separation between data and control messages is
cleaner.

As for using Protobuf's Any: so far, we've refrained from exposing
Protobuf by using bytes, would we want to change that now?

Best,
David

On 10/16/19, Jacques Nadeau  wrote:
> Hey David,
>
> RE: Async: I was trying to match the pattern we use for doget/doput for
> async. Yes, more thinking java given java grpc's async always pattern.
>
> On the comment around the FlightData, I think it is overloading the message
> to use metadata for this. If I want to send a control message independently
> of the data message, I would have to define something like an empty flight
> data message that has custom metadata. Why not support a container object
> with a oneof{FlightData, Any} in it instead so users can add more data as
> desired. The default impl could be a noop for the Any messages.
>
> On Tue, Oct 15, 2019 at 6:50 PM David Li  wrote:
>
>> Hi Jacques,
>>
>> Thanks for the comments.
>>
>> - I do agree DoExchange is a better name!
>> - FlightData already has metadata fields as a result of prior
>> proposals, so I don't think we need a new message to carry that kind
>> of information.
>> - I like the suggestion of an async handler to handle incoming
>> messages as the fundamental API; it would actually be quite natural to
>> implement in Flight/Java. I will note that it's not possible in
>> C++/Python without spawning a thread, though. (In essence, gRPC-Java
>> is async-always and gRPC-C++ is sync-always.) There are experimental
>> C++ APIs that would let us do something similar to Java, but those are
>> only in relatively recent gRPC versions and are still under
>> development (contrary to the interceptor APIs which have been around
>> for quite a while).
>>
>> Thanks,
>> David
>>
>> On 10/15/19, Jacques Nadeau  wrote:
>> > I like it. Added some comments to the doc. Might worth discussion here
>> > depending on your thoughts.
>> >
>> > On Tue, Oct 15, 2019 at 7:11 AM David Li  wrote:
>> >
>> >> Hey Ryan,
>> >>
>> >> Thanks for the comments.
>> >>
>> >> Concrete example: I've edited the doc to provide a Python strawman.
>> >>
>> >> Sync vs async: while I don't touch on it, you could interleave uploads
>> >> and downloads if you were so inclined. Right now, synchronous APIs
>> >> make this error-prone, e.g. if both client and server wait for each
>> >> other due to an application logic bug. (gRPC doesn't give us the
>> >> ability to have per-read timeouts, only an overall timeout.) As an
>> >> example of this happening with DoPut, see ARROW-6063:
>> >> https://issues.apache.org/jira/browse/ARROW-6063
>> >>
>> >> This is mostly tangential though, eventually we will want to design
>> >> asynchronous APIs for Flight as a whole. A bidirectional stream like
>> >> this (and like DoPut) just makes these pitfalls easier to run into.
>> >>
>> >> Using DoPut+DoGet: I discussed this in the proposal, but the main
>> >> concern is that depending on how you deploy, two separate calls could
>> >> get routed to different instances. Additionally, gRPC has some
>> >> reconnection behaviors; if the server goes away in between the two
>> >> calls, but it then restarts or there is another instance available,
>> >> the client will happily reconnect to the new server without warning.
>> >>
>> >> Thanks,
>> >> David
>> >>
>> >> On 10/15/19, Ryan Murray  wrote:
>> >> > Hey David,
>> >> >
>> >> > I think this proposal makes a lot of sense. I like it and the
>> >> > possibility
>> >> > of remote compute via arrow buffers. One thing that would help me
>> would
>> >> be
>> >> > a concrete example of the API in a real life use case. Also, what
>> would
>> >> the
>> >> > client experience be in terms of sync vs asyc? Would the client
>> >> > block
>> >> till
>> >> > the bidirectional call return ie c = flight.vector_mult(a, b) or
>> >> > would
>> >> the
>> >> > client wait to be signaled that computation was done. If the later
>> >> > how
>> >> > is
>> >> > that different from a DoPut then DoGet? I suppose that this could be
>> >> > implemented without extending the RPC interface but rather by a
>> >> > function/util?
>> >> >
>> >> >
>> >> > Best,
>> >> >
>> >> > Ryan
>> >> >
>> >> > On Sun, Oct 13, 2019 at 9:24 PM David Li 
>> wrote:
>> >> >
>> >> >> Hi all,
>> >> >>
>> >> >> We've been using Flight quite successfully so far, but we have
>> >> >> identified a new use case on the horizon: being able to both send
>> >> >> and
>> >> >> retrieve Arrow data within a single RPC call. To that end, I've
>> >> >> written 

[jira] [Created] (ARROW-6910) pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits

2019-10-16 Thread V Luong (Jira)
V Luong created ARROW-6910:
--

 Summary: pyarrow.parquet.read_table(...) takes up lots of memory 
which is not released until program exits
 Key: ARROW-6910
 URL: https://issues.apache.org/jira/browse/ARROW-6910
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.15.0
Reporter: V Luong


I realize that when I read up a lot of Parquet files using 
pyarrow.parquet.read_table(...), my program's memory usage becomes very 
bloated, although I don't keep the table objects after converting them to 
Pandas DFs.

You can try this in an interactive Python shell to reproduce this problem:

```{python}
from pyarrow.parquet import read_table

for path in paths_of_a_bunch_of_big_parquet_files:
read_table(path, use_threads=True, memory_map=False)
# note that I'm not assigning the read_table(...) result to anything, so 
I'm not creating any new objects at all

```

After that For loop above, if you view the memory using (e.g. using htop 
program), you'll see that the Python program has taken up a lot of memory. That 
memory is only released when you exit() from Python.

This problem means that my compute jobs using PyArrow currently need to use 
bigger server instances than I think is necessary, which translates to 
significant extra cost.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread John Muehlhausen
"that's where the danger lies"

What danger?  I have no idea what the specific danger is, assuming that all
reference implementations have test cases that hedge around this.

I contend that it can only be useful and will never be harmful.  What are
the counter-examples of concrete harm?


[jira] [Created] (ARROW-6909) [Python] Define PyObjectBuffer with Py_XDECREF logic in destructor for object array memory

2019-10-16 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6909:
---

 Summary: [Python] Define PyObjectBuffer with Py_XDECREF logic in 
destructor for object array memory
 Key: ARROW-6909
 URL: https://issues.apache.org/jira/browse/ARROW-6909
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


Possible follow up to ARROW-6874



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread Wes McKinney
On Wed, Oct 16, 2019 at 12:32 PM John Muehlhausen  wrote:
>
> I really need to "get into the zone" on some other development today, but I
> want to remind us of something earlier in the thread that gave me the
> impression I wasn't stomping on too many paradigms with this proposal:
>
> Wes: ``So the "length" field in RecordBatch is already the utilized number
> of rows. The body buffers can certainly have excess unused space. So
> your application can mutate Flatbuffer "length" field in-place as
> new records are filled in.''

I'm objecting to RecordBatch.length being inconsistent with the
constituent field lengths, that's where the danger lies. If all of the
lengths are consistent, no code changes are necessary.

> If RecordBatch.length is the utilized number of rows then my PR makes this
> actually true.  Yes, we need it in a handful of implementations.  I'm
> willing to provide all of them.  To me that is the lowest complexity
> solution.
>
> -John
>
> On Wed, Oct 16, 2019 at 10:45 AM Wes McKinney  wrote:
>
> > On Wed, Oct 16, 2019 at 10:17 AM John Muehlhausen  wrote:
> > >
> > > "pyarrow is intended as a developer-facing library, not a user-facing
> > one"
> > >
> > > Is that really the core issue?  I doubt you would want to add this
> > proposed
> > > logic to pandas even though it is user-facing, because then pandas will
> > > either have to re-implement what it means to read a batch (to respect
> > > length when it is smaller than array length) or else rely on the single
> > > blessed custom metadata for doing this, which doesn't make it custom
> > > anymore.
> >
> > What you have proposed in your PR amounts to an alteration of the IPC
> > format to suit this use case. This pushes complexity onto _every_
> > implementation that will need to worry about a "truncated" record
> > batch. I'd rather avoid this unless it is truly the only way.
> >
> > Note that we serialize a significant amount of custom metadata already
> > to address pandas-specific issues, and have not had to make any
> > changes to the columnar format as a result.
> >
> > > I think really your concern is that perhaps nobody wants this but me,
> > > therefore it should not be in arrow or pandas regardless of whether it is
> > > user-facing?  But, if that is your thinking, is it true?  What is our
> > > solution to the locality/latency problem for systems that ingest and
> > > process concurrently, if not this solution?  I do see it as a general
> > > problem that needs at least the beginnings of a general solution... not a
> > > "custom" one.
> >
> > We use the custom_metadata fields to implement a number of built-in
> > things in the project, such as extension types. If enough people find
> > this useful, then it can be promoted to a formalized concept. As far
> > as I can tell, you have developed quite a bit of custom code related
> > to this for your application, including manipulating Flatbuffers
> > metadata in place to maintain the populated length, so the barrier to
> > entry to being able to properly take advantage of this is rather high.
> >
> > > Also, I wonder whether it is true that pyarrow avoids smart/magical
> > > things.  The entire concept of a "Table" seems to be in that category?
> > The
> > > docs specifically mention that it is for convenience.
> > >
> >
> > Table arose out of legitimate developer need. There are a number of
> > areas of the project that would be much more difficult if we had to
> > worry about regularizing column chunking at any call site that returns
> > an in-memory dataset.
> >
> > > I'd like to focus on two questions:
> > > 1- What is the Arrow general solution to the locality/latency tradeoff
> > > problem for systems that ingest and process data concurrently?  This
> > > proposed solution or something else?  Or if we propose not to address the
> > > problem, why?
> > > 2- What will the proposed change negatively impact?  It seems that all we
> > > are talking about is respecting batch length if arrays happen to be
> > longer.
> >
> > I'm suggesting to help you solve the post-read truncation problem
> > without modifying the IPC protocol. If you want to make things work
> > for the users without knowledge, I think this can be achieved through
> > a plug-in API to define a metadata handler-callback to apply the
> > truncation to the record batches.
> >
> > > Thanks,
> > > -John
> > >
> > > On Wed, Oct 16, 2019 at 8:37 AM Wes McKinney 
> > wrote:
> > >
> > > > hi John,
> > > >
> > > > > As a practical matter, the reason metadata is not a good solution
> > for me
> > > > is that it requires awareness on the part of the reader.  I want
> > (e.g.) a
> > > > researcher in Python to be able to map a file of batches in IPC format
> > > > without needing to worry about the fact that the file was built in a
> > > > streaming fashion and therefore has some unused array elements.
> > > >
> > > > I don't find this argument to be persuasive.
> > > >
> > > > pyarrow is intended as a developer-facing 

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread Wes McKinney
hi John,

On Wed, Oct 16, 2019 at 11:59 AM John Muehlhausen  wrote:
>
> I'm in Python, I'm a user, and I'm not allowed to import pyarrow because it
> isn't for me.

I think you're misrepresenting what I'm saying.

It's our expectations that users will largely consume pyarrow
indirectly as a dependency rather than using it directly. Not every
piece of software needs to be designed around the needs of end users.

>
> There exists some Arrow record batches in plasma.  I need to get one slice
> of one batch as a pandas dataframe.
>
> What do I do?
>
> There exists some Arrow record batches in a file.  I need to get one slice
> of one batch as a pandas dataframe.
>
> What do I do?
>
> Are you contemplating that all of the above is possible using only
> pandas APIs?
>
> Does "one slice of one batch" go away once pandas (version 2) does not
> require conversion, since it will be zero copy and the user can slice in
> pandas with no performance hit?
>
> I'm really stumbling over this idea that users can't import pyarrow.  I'm
> not sure it makes sense to continue to discuss user-level IPC (plasma,
> files, etc) until I can come to grips with how users use pyarrow without
> importing it.

I'm not saying that. I'm suggesting that you should provide your users
with convenient functions that handle the low-level details for them
as intended automatically.

>
> Once I see how it works without my proposed change, we can go back to how
> the user ignores the empty/undefined array portions without knowing whether
> they exist.
>
> -John
>
> On Wed, Oct 16, 2019 at 10:45 AM Wes McKinney  wrote:
>
> > On Wed, Oct 16, 2019 at 10:17 AM John Muehlhausen  wrote:
> > >
> > > "pyarrow is intended as a developer-facing library, not a user-facing
> > one"
> > >
> > > Is that really the core issue?  I doubt you would want to add this
> > proposed
> > > logic to pandas even though it is user-facing, because then pandas will
> > > either have to re-implement what it means to read a batch (to respect
> > > length when it is smaller than array length) or else rely on the single
> > > blessed custom metadata for doing this, which doesn't make it custom
> > > anymore.
> >
> > What you have proposed in your PR amounts to an alteration of the IPC
> > format to suit this use case. This pushes complexity onto _every_
> > implementation that will need to worry about a "truncated" record
> > batch. I'd rather avoid this unless it is truly the only way.
> >
> > Note that we serialize a significant amount of custom metadata already
> > to address pandas-specific issues, and have not had to make any
> > changes to the columnar format as a result.
> >
> > > I think really your concern is that perhaps nobody wants this but me,
> > > therefore it should not be in arrow or pandas regardless of whether it is
> > > user-facing?  But, if that is your thinking, is it true?  What is our
> > > solution to the locality/latency problem for systems that ingest and
> > > process concurrently, if not this solution?  I do see it as a general
> > > problem that needs at least the beginnings of a general solution... not a
> > > "custom" one.
> >
> > We use the custom_metadata fields to implement a number of built-in
> > things in the project, such as extension types. If enough people find
> > this useful, then it can be promoted to a formalized concept. As far
> > as I can tell, you have developed quite a bit of custom code related
> > to this for your application, including manipulating Flatbuffers
> > metadata in place to maintain the populated length, so the barrier to
> > entry to being able to properly take advantage of this is rather high.
> >
> > > Also, I wonder whether it is true that pyarrow avoids smart/magical
> > > things.  The entire concept of a "Table" seems to be in that category?
> > The
> > > docs specifically mention that it is for convenience.
> > >
> >
> > Table arose out of legitimate developer need. There are a number of
> > areas of the project that would be much more difficult if we had to
> > worry about regularizing column chunking at any call site that returns
> > an in-memory dataset.
> >
> > > I'd like to focus on two questions:
> > > 1- What is the Arrow general solution to the locality/latency tradeoff
> > > problem for systems that ingest and process data concurrently?  This
> > > proposed solution or something else?  Or if we propose not to address the
> > > problem, why?
> > > 2- What will the proposed change negatively impact?  It seems that all we
> > > are talking about is respecting batch length if arrays happen to be
> > longer.
> >
> > I'm suggesting to help you solve the post-read truncation problem
> > without modifying the IPC protocol. If you want to make things work
> > for the users without knowledge, I think this can be achieved through
> > a plug-in API to define a metadata handler-callback to apply the
> > truncation to the record batches.
> >
> > > Thanks,
> > > -John
> > >
> > > On Wed, Oct 16, 2019 at 8:37 AM 

[jira] [Created] (ARROW-6908) Add support for Bazel

2019-10-16 Thread Aryan Naraghi (Jira)
Aryan Naraghi created ARROW-6908:


 Summary: Add support for Bazel
 Key: ARROW-6908
 URL: https://issues.apache.org/jira/browse/ARROW-6908
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Aryan Naraghi


I would like to use Arrow in a C++ project that uses Bazel.

 

Would it be possible to add support for building Arrow using Bazel?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6907) Allow Plasma store to batch notifications to clients

2019-10-16 Thread Danyang (Jira)
Danyang created ARROW-6907:
--

 Summary: Allow Plasma store to batch notifications to clients
 Key: ARROW-6907
 URL: https://issues.apache.org/jira/browse/ARROW-6907
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Reporter: Danyang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6906) Use re2 instead of std::regex in Dataset partitionschemes implementation

2019-10-16 Thread Prudhvi Porandla (Jira)
Prudhvi Porandla created ARROW-6906:
---

 Summary: Use re2 instead of std::regex in Dataset partitionschemes 
implementation
 Key: ARROW-6906
 URL: https://issues.apache.org/jira/browse/ARROW-6906
 Project: Apache Arrow
  Issue Type: Task
Reporter: Prudhvi Porandla
Assignee: Ben Kietzman


std::regex is not implemented in older versions (< 4.9) of GCC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread John Muehlhausen
I'm in Python, I'm a user, and I'm not allowed to import pyarrow because it
isn't for me.

There exists some Arrow record batches in plasma.  I need to get one slice
of one batch as a pandas dataframe.

What do I do?

There exists some Arrow record batches in a file.  I need to get one slice
of one batch as a pandas dataframe.

What do I do?

Are you contemplating that all of the above is possible using only
pandas APIs?

Does "one slice of one batch" go away once pandas (version 2) does not
require conversion, since it will be zero copy and the user can slice in
pandas with no performance hit?

I'm really stumbling over this idea that users can't import pyarrow.  I'm
not sure it makes sense to continue to discuss user-level IPC (plasma,
files, etc) until I can come to grips with how users use pyarrow without
importing it.

Once I see how it works without my proposed change, we can go back to how
the user ignores the empty/undefined array portions without knowing whether
they exist.

-John

On Wed, Oct 16, 2019 at 10:45 AM Wes McKinney  wrote:

> On Wed, Oct 16, 2019 at 10:17 AM John Muehlhausen  wrote:
> >
> > "pyarrow is intended as a developer-facing library, not a user-facing
> one"
> >
> > Is that really the core issue?  I doubt you would want to add this
> proposed
> > logic to pandas even though it is user-facing, because then pandas will
> > either have to re-implement what it means to read a batch (to respect
> > length when it is smaller than array length) or else rely on the single
> > blessed custom metadata for doing this, which doesn't make it custom
> > anymore.
>
> What you have proposed in your PR amounts to an alteration of the IPC
> format to suit this use case. This pushes complexity onto _every_
> implementation that will need to worry about a "truncated" record
> batch. I'd rather avoid this unless it is truly the only way.
>
> Note that we serialize a significant amount of custom metadata already
> to address pandas-specific issues, and have not had to make any
> changes to the columnar format as a result.
>
> > I think really your concern is that perhaps nobody wants this but me,
> > therefore it should not be in arrow or pandas regardless of whether it is
> > user-facing?  But, if that is your thinking, is it true?  What is our
> > solution to the locality/latency problem for systems that ingest and
> > process concurrently, if not this solution?  I do see it as a general
> > problem that needs at least the beginnings of a general solution... not a
> > "custom" one.
>
> We use the custom_metadata fields to implement a number of built-in
> things in the project, such as extension types. If enough people find
> this useful, then it can be promoted to a formalized concept. As far
> as I can tell, you have developed quite a bit of custom code related
> to this for your application, including manipulating Flatbuffers
> metadata in place to maintain the populated length, so the barrier to
> entry to being able to properly take advantage of this is rather high.
>
> > Also, I wonder whether it is true that pyarrow avoids smart/magical
> > things.  The entire concept of a "Table" seems to be in that category?
> The
> > docs specifically mention that it is for convenience.
> >
>
> Table arose out of legitimate developer need. There are a number of
> areas of the project that would be much more difficult if we had to
> worry about regularizing column chunking at any call site that returns
> an in-memory dataset.
>
> > I'd like to focus on two questions:
> > 1- What is the Arrow general solution to the locality/latency tradeoff
> > problem for systems that ingest and process data concurrently?  This
> > proposed solution or something else?  Or if we propose not to address the
> > problem, why?
> > 2- What will the proposed change negatively impact?  It seems that all we
> > are talking about is respecting batch length if arrays happen to be
> longer.
>
> I'm suggesting to help you solve the post-read truncation problem
> without modifying the IPC protocol. If you want to make things work
> for the users without knowledge, I think this can be achieved through
> a plug-in API to define a metadata handler-callback to apply the
> truncation to the record batches.
>
> > Thanks,
> > -John
> >
> > On Wed, Oct 16, 2019 at 8:37 AM Wes McKinney 
> wrote:
> >
> > > hi John,
> > >
> > > > As a practical matter, the reason metadata is not a good solution
> for me
> > > is that it requires awareness on the part of the reader.  I want
> (e.g.) a
> > > researcher in Python to be able to map a file of batches in IPC format
> > > without needing to worry about the fact that the file was built in a
> > > streaming fashion and therefore has some unused array elements.
> > >
> > > I don't find this argument to be persuasive.
> > >
> > > pyarrow is intended as a developer-facing library, not a user-facing
> > > one. I don't think you should be having the kinds of users you are
> > > describing using pyarrow 

Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-10-16-0

2019-10-16 Thread Krisztián Szűcs
The OSX builds are failing because home-brew tries to compile the
dependencies instead of installing the precompiled binaries.
It might be because the outdated Xcode version we use, perhaps brew has
stopped providing binaries for older Xcode.
I've created a tracking jira
https://issues.apache.org/jira/browse/ARROW-6905

On Wed, Oct 16, 2019 at 8:16 AM Crossbow  wrote:

>
> Arrow Build Report for Job nightly-2019-10-16-0
>
> All tasks:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0
>
> Failed Tasks:
> - wheel-manylinux1-cp27mu:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux1-cp27mu
> - wheel-win-cp37m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-appveyor-wheel-win-cp37m
> - docker-clang-format:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-clang-format
> - wheel-manylinux1-cp37m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux1-cp37m
> - gandiva-jar-osx:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-gandiva-jar-osx
> - wheel-osx-cp35m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-osx-cp35m
> - ubuntu-disco:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-ubuntu-disco
> - debian-stretch:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-debian-stretch
> - wheel-manylinux2010-cp27mu:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux2010-cp27mu
> - wheel-win-cp36m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-appveyor-wheel-win-cp36m
> - wheel-osx-cp37m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-osx-cp37m
> - debian-buster:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-debian-buster
> - wheel-manylinux1-cp35m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux1-cp35m
> - wheel-manylinux2010-cp35m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux2010-cp35m
> - homebrew-cpp:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-homebrew-cpp
> - gandiva-jar-trusty:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-gandiva-jar-trusty
> - wheel-manylinux2010-cp37m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux2010-cp37m
> - wheel-manylinux2010-cp36m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux2010-cp36m
> - wheel-osx-cp27m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-osx-cp27m
> - wheel-osx-cp36m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-osx-cp36m
> - conda-linux-gcc-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-conda-linux-gcc-py37
> - wheel-manylinux1-cp36m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux1-cp36m
> - conda-linux-gcc-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-conda-linux-gcc-py36
> - conda-linux-gcc-py27:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-conda-linux-gcc-py27
>
> Succeeded Tasks:
> - docker-dask-integration:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-dask-integration
> - centos-7:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-centos-7
> - docker-python-2.7:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-python-2.7
> - docker-spark-integration:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-spark-integration
> - docker-turbodbc-integration:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-turbodbc-integration
> - docker-cpp-release:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-cpp-release
> - wheel-manylinux2010-cp27m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux2010-cp27m
> - docker-cpp-cmake32:
>   URL:
> 

[jira] [Created] (ARROW-6905) [Packaging][OSX] Nightly builds on MacOS are failing because of brew compile timeouts

2019-10-16 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-6905:
--

 Summary: [Packaging][OSX] Nightly builds on MacOS are failing 
because of brew compile timeouts
 Key: ARROW-6905
 URL: https://issues.apache.org/jira/browse/ARROW-6905
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Krisztian Szucs


Home-brew in our packaging builds has recently started to compile the 
dependencies instead of installing precompiled binaries. I'm not sure what's 
the issue, perhaps it is because the too old Xcode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Arrow sync call October 16 at 12:00 US/Eastern, 16:00 UTC

2019-10-16 Thread Neal Richardson
Attendees:

Micah Kornfield
Uwe Korn
Bryan Cutler
Rok Mihevc
Prudhvi Porandla
Ursa Labs (Antoine, Ben, François, Joris, Krisztián, Neal, Wes, in the
same room!)


Discussion:
* Cython in conda: Uwe to update
* When to do 0.15.1? There are only 2 open issues left tagged with
0.15.1. Only bug fixes. Krisztián will be release manager
* Bryan: add map array to pyarrow for spark compat, want to get it in for 1.0
* Followup on result vs. status from last time. Results API to be
internal, bindings always to have Status return? Or bind to Result for
new APIs? Circle back on mailing list
* Java adapters: a few open discussions
* Use of std::regex in dataset partition detection: not well supported
on older GCC. Ben to replace.
* Issue with failing macOS/Homebrew builds: will make a Jira

On Wed, Oct 16, 2019 at 6:24 AM Neal Richardson
 wrote:
>
> Hi all, our biweekly call is coming up in a couple of hours at
> https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes
> will be sent out to the mailing list afterwards.
>
> Neal


[jira] [Created] (ARROW-6904) [Python] Implement MapArray and MapType

2019-10-16 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-6904:
---

 Summary: [Python] Implement MapArray and MapType
 Key: ARROW-6904
 URL: https://issues.apache.org/jira/browse/ARROW-6904
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Bryan Cutler
Assignee: Bryan Cutler
 Fix For: 1.0.0


Map arrays are already added to C++, need to expose them in the Python API also



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread Wes McKinney
On Wed, Oct 16, 2019 at 10:17 AM John Muehlhausen  wrote:
>
> "pyarrow is intended as a developer-facing library, not a user-facing one"
>
> Is that really the core issue?  I doubt you would want to add this proposed
> logic to pandas even though it is user-facing, because then pandas will
> either have to re-implement what it means to read a batch (to respect
> length when it is smaller than array length) or else rely on the single
> blessed custom metadata for doing this, which doesn't make it custom
> anymore.

What you have proposed in your PR amounts to an alteration of the IPC
format to suit this use case. This pushes complexity onto _every_
implementation that will need to worry about a "truncated" record
batch. I'd rather avoid this unless it is truly the only way.

Note that we serialize a significant amount of custom metadata already
to address pandas-specific issues, and have not had to make any
changes to the columnar format as a result.

> I think really your concern is that perhaps nobody wants this but me,
> therefore it should not be in arrow or pandas regardless of whether it is
> user-facing?  But, if that is your thinking, is it true?  What is our
> solution to the locality/latency problem for systems that ingest and
> process concurrently, if not this solution?  I do see it as a general
> problem that needs at least the beginnings of a general solution... not a
> "custom" one.

We use the custom_metadata fields to implement a number of built-in
things in the project, such as extension types. If enough people find
this useful, then it can be promoted to a formalized concept. As far
as I can tell, you have developed quite a bit of custom code related
to this for your application, including manipulating Flatbuffers
metadata in place to maintain the populated length, so the barrier to
entry to being able to properly take advantage of this is rather high.

> Also, I wonder whether it is true that pyarrow avoids smart/magical
> things.  The entire concept of a "Table" seems to be in that category?  The
> docs specifically mention that it is for convenience.
>

Table arose out of legitimate developer need. There are a number of
areas of the project that would be much more difficult if we had to
worry about regularizing column chunking at any call site that returns
an in-memory dataset.

> I'd like to focus on two questions:
> 1- What is the Arrow general solution to the locality/latency tradeoff
> problem for systems that ingest and process data concurrently?  This
> proposed solution or something else?  Or if we propose not to address the
> problem, why?
> 2- What will the proposed change negatively impact?  It seems that all we
> are talking about is respecting batch length if arrays happen to be longer.

I'm suggesting to help you solve the post-read truncation problem
without modifying the IPC protocol. If you want to make things work
for the users without knowledge, I think this can be achieved through
a plug-in API to define a metadata handler-callback to apply the
truncation to the record batches.

> Thanks,
> -John
>
> On Wed, Oct 16, 2019 at 8:37 AM Wes McKinney  wrote:
>
> > hi John,
> >
> > > As a practical matter, the reason metadata is not a good solution for me
> > is that it requires awareness on the part of the reader.  I want (e.g.) a
> > researcher in Python to be able to map a file of batches in IPC format
> > without needing to worry about the fact that the file was built in a
> > streaming fashion and therefore has some unused array elements.
> >
> > I don't find this argument to be persuasive.
> >
> > pyarrow is intended as a developer-facing library, not a user-facing
> > one. I don't think you should be having the kinds of users you are
> > describing using pyarrow directly, instead consuming the library
> > through a layer above it. Specifically, we are deliberately avoiding
> > doing anything too "smart" or "magical", instead maintaining tight
> > developer control over what is going on.
> >
> > - Wes
> >
> > On Wed, Oct 16, 2019 at 2:18 AM Micah Kornfield 
> > wrote:
> > >
> > > Still thinking through the implications here, but to save others from
> > > having to go search [1] is the PR.
> > >
> > > [1] https://github.com/apache/arrow/pull/5663/files
> > >
> > > On Tue, Oct 15, 2019 at 1:42 PM John Muehlhausen  wrote:
> > >
> > > > A proposal with linked PR now exists in ARROW-5916 and Wes commented
> > that
> > > > we should kick it around some more.
> > > >
> > > > The high-level topic is how Apache Arrow intersects with streaming
> > > > methodologies:
> > > >
> > > > If record batches are strictly immutable, a difficult trade-off is
> > created
> > > > for streaming data collection: either I can have low-latency
> > presentation
> > > > of new data by appending very small batches (often 1 row) to the IPC
> > stream
> > > > and lose columnar layout benefits, or I can have high-latency
> > presentation
> > > > of new data by waiting to append a batch until it is 

Re: [Discuss][FlightRPC] Extensions to Flight: "DoBidirectional"

2019-10-16 Thread Jacques Nadeau
Hey David,

RE: Async: I was trying to match the pattern we use for doget/doput for
async. Yes, more thinking java given java grpc's async always pattern.

On the comment around the FlightData, I think it is overloading the message
to use metadata for this. If I want to send a control message independently
of the data message, I would have to define something like an empty flight
data message that has custom metadata. Why not support a container object
with a oneof{FlightData, Any} in it instead so users can add more data as
desired. The default impl could be a noop for the Any messages.

On Tue, Oct 15, 2019 at 6:50 PM David Li  wrote:

> Hi Jacques,
>
> Thanks for the comments.
>
> - I do agree DoExchange is a better name!
> - FlightData already has metadata fields as a result of prior
> proposals, so I don't think we need a new message to carry that kind
> of information.
> - I like the suggestion of an async handler to handle incoming
> messages as the fundamental API; it would actually be quite natural to
> implement in Flight/Java. I will note that it's not possible in
> C++/Python without spawning a thread, though. (In essence, gRPC-Java
> is async-always and gRPC-C++ is sync-always.) There are experimental
> C++ APIs that would let us do something similar to Java, but those are
> only in relatively recent gRPC versions and are still under
> development (contrary to the interceptor APIs which have been around
> for quite a while).
>
> Thanks,
> David
>
> On 10/15/19, Jacques Nadeau  wrote:
> > I like it. Added some comments to the doc. Might worth discussion here
> > depending on your thoughts.
> >
> > On Tue, Oct 15, 2019 at 7:11 AM David Li  wrote:
> >
> >> Hey Ryan,
> >>
> >> Thanks for the comments.
> >>
> >> Concrete example: I've edited the doc to provide a Python strawman.
> >>
> >> Sync vs async: while I don't touch on it, you could interleave uploads
> >> and downloads if you were so inclined. Right now, synchronous APIs
> >> make this error-prone, e.g. if both client and server wait for each
> >> other due to an application logic bug. (gRPC doesn't give us the
> >> ability to have per-read timeouts, only an overall timeout.) As an
> >> example of this happening with DoPut, see ARROW-6063:
> >> https://issues.apache.org/jira/browse/ARROW-6063
> >>
> >> This is mostly tangential though, eventually we will want to design
> >> asynchronous APIs for Flight as a whole. A bidirectional stream like
> >> this (and like DoPut) just makes these pitfalls easier to run into.
> >>
> >> Using DoPut+DoGet: I discussed this in the proposal, but the main
> >> concern is that depending on how you deploy, two separate calls could
> >> get routed to different instances. Additionally, gRPC has some
> >> reconnection behaviors; if the server goes away in between the two
> >> calls, but it then restarts or there is another instance available,
> >> the client will happily reconnect to the new server without warning.
> >>
> >> Thanks,
> >> David
> >>
> >> On 10/15/19, Ryan Murray  wrote:
> >> > Hey David,
> >> >
> >> > I think this proposal makes a lot of sense. I like it and the
> >> > possibility
> >> > of remote compute via arrow buffers. One thing that would help me
> would
> >> be
> >> > a concrete example of the API in a real life use case. Also, what
> would
> >> the
> >> > client experience be in terms of sync vs asyc? Would the client block
> >> till
> >> > the bidirectional call return ie c = flight.vector_mult(a, b) or would
> >> the
> >> > client wait to be signaled that computation was done. If the later how
> >> > is
> >> > that different from a DoPut then DoGet? I suppose that this could be
> >> > implemented without extending the RPC interface but rather by a
> >> > function/util?
> >> >
> >> >
> >> > Best,
> >> >
> >> > Ryan
> >> >
> >> > On Sun, Oct 13, 2019 at 9:24 PM David Li 
> wrote:
> >> >
> >> >> Hi all,
> >> >>
> >> >> We've been using Flight quite successfully so far, but we have
> >> >> identified a new use case on the horizon: being able to both send and
> >> >> retrieve Arrow data within a single RPC call. To that end, I've
> >> >> written up a proposal for a new RPC method:
> >> >>
> >> >>
> >>
> https://docs.google.com/document/d/1Hh-3Z0hK5PxyEYFxwVxp77jens3yAgC_cpp0TGW-dcw/edit?usp=sharing
> >> >>
> >> >> Please let me know if you can't view or comment on the document. I'd
> >> >> appreciate any feedback; I think this is a relatively straightforward
> >> >> addition - it is essentially "DoPutThenGet".
> >> >>
> >> >> This is a format change and would require a vote. I've decided to
> >> >> table the other format change I had proposed (on DoPut), as it
> doesn't
> >> >> functionally change Flight, just the interpretation of the semantics.
> >> >>
> >> >> Thanks,
> >> >> David
> >> >>
> >> >
> >> >
> >> > --
> >> >
> >> > Ryan Murray  | Principal Consulting Engineer
> >> >
> >> > +447540852009 | rym...@dremio.com
> >> >
> >> > 
> >> > Check out our GitHub 

[jira] [Created] (ARROW-6903) [Python] Wheels broken after ARROW-6860 changes

2019-10-16 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6903:
---

 Summary: [Python] Wheels broken after ARROW-6860 changes
 Key: ARROW-6903
 URL: https://issues.apache.org/jira/browse/ARROW-6903
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0, 0.15.1


I forgot to handle the .so bundling issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread Wes McKinney
hi John,

> As a practical matter, the reason metadata is not a good solution for me is 
> that it requires awareness on the part of the reader.  I want (e.g.) a 
> researcher in Python to be able to map a file of batches in IPC format 
> without needing to worry about the fact that the file was built in a 
> streaming fashion and therefore has some unused array elements.

I don't find this argument to be persuasive.

pyarrow is intended as a developer-facing library, not a user-facing
one. I don't think you should be having the kinds of users you are
describing using pyarrow directly, instead consuming the library
through a layer above it. Specifically, we are deliberately avoiding
doing anything too "smart" or "magical", instead maintaining tight
developer control over what is going on.

- Wes

On Wed, Oct 16, 2019 at 2:18 AM Micah Kornfield  wrote:
>
> Still thinking through the implications here, but to save others from
> having to go search [1] is the PR.
>
> [1] https://github.com/apache/arrow/pull/5663/files
>
> On Tue, Oct 15, 2019 at 1:42 PM John Muehlhausen  wrote:
>
> > A proposal with linked PR now exists in ARROW-5916 and Wes commented that
> > we should kick it around some more.
> >
> > The high-level topic is how Apache Arrow intersects with streaming
> > methodologies:
> >
> > If record batches are strictly immutable, a difficult trade-off is created
> > for streaming data collection: either I can have low-latency presentation
> > of new data by appending very small batches (often 1 row) to the IPC stream
> > and lose columnar layout benefits, or I can have high-latency presentation
> > of new data by waiting to append a batch until it is large enough to gain
> > significant columnar layout benefits.  During this waiting period the new
> > data is unavailable to processing.
> >
> > If, on the other hand, [0,length) of a batch is immutable but length may
> > increase, the trade-off is eliminated: I can pre-allocate a batch and
> > populate records in it when they occur (without waiting), and also gain
> > columnar benefits as each "closed" batch will be large.  (A batch may be
> > practically "closed" before the arrays are full when the projection of
> > variable-length buffer space is wrong... a space/time tradeoff in favor of
> > time.)
> >
> > Looking ahead to a day when the reference implementation(s) will be able to
> > bump RecordBatch.length while populating pre-allocated records
> > in-place, ARROW-5916 reads such batches by ignoring portions of arrays that
> > are beyond RecordBatch.length.
> >
> > If we are not looking ahead to such a day, the discussion is about the
> > alternative way that Arrow will avoid the latency/locality tradeoff
> > inherent in streaming data collection.  Or, if the answer is "streaming
> > apps are and will always be out of scope", that idea needs to be defended
> > from the observation that practitioners are moving more towards the fusion
> > of batch and streaming, not away from it.
> >
> > As a practical matter, the reason metadata is not a good solution for me is
> > that it requires awareness on the part of the reader.  I want (e.g.) a
> > researcher in Python to be able to map a file of batches in IPC format
> > without needing to worry about the fact that the file was built in a
> > streaming fashion and therefore has some unused array elements.
> >
> > The change itself seems relatively simple.  What negative consequences do
> > we anticipate, if any?
> >
> > Thanks,
> > -John
> >
> > On Fri, Jul 5, 2019 at 10:42 AM John Muehlhausen  wrote:
> >
> > > This seems to help... still testing it though.
> > >
> > >   Status GetFieldMetadata(int field_index, ArrayData* out) {
> > > auto nodes = metadata_->nodes();
> > > // pop off a field
> > > if (field_index >= static_cast(nodes->size())) {
> > >   return Status::Invalid("Ran out of field metadata, likely
> > > malformed");
> > > }
> > > const flatbuf::FieldNode* node = nodes->Get(field_index);
> > >
> > > *//out->length = node->length();*
> > > *out->length = metadata_->length();*
> > > out->null_count = node->null_count();
> > > out->offset = 0;
> > > return Status::OK();
> > >   }
> > >
> > > On Fri, Jul 5, 2019 at 10:24 AM John Muehlhausen  wrote:
> > >
> > >> So far it seems as if pyarrow is completely ignoring the
> > >> RecordBatch.length field.  More info to follow...
> > >>
> > >> On Tue, Jul 2, 2019 at 3:02 PM John Muehlhausen  wrote:
> > >>
> > >>> Crikey! I'll do some testing around that and suggest some test cases to
> > >>> ensure it continues to work, assuming that it does.
> > >>>
> > >>> -John
> > >>>
> > >>> On Tue, Jul 2, 2019 at 2:41 PM Wes McKinney 
> > wrote:
> > >>>
> >  Thanks for the attachment, it's helpful.
> > 
> >  On Tue, Jul 2, 2019 at 1:40 PM John Muehlhausen  wrote:
> >  >
> >  > Attachments referred to in previous two messages:
> >  >
> > 
> > 

[jira] [Created] (ARROW-6902) [C++] Add String*/Binary* support for Compare kernels

2019-10-16 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6902:
-

 Summary: [C++] Add String*/Binary* support for Compare kernels
 Key: ARROW-6902
 URL: https://issues.apache.org/jira/browse/ARROW-6902
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [C++] The quest for zero-dependency builds

2019-10-16 Thread Antoine Pitrou



Perhaps meson is also worth exploring?


Le 15/10/2019 à 23:06, Micah Kornfield a écrit :

Hi Wes,
I agree on both accounts that it won't be a done in the short term, and it
makes sense to tackle in incrementally.  Like I said I don't have much
bandwidth at the moment but might be able to re-arrange a few things on my
plate.  I think some people have asked on the mailing list how they might
be able to help, this might be one area that doesn't require a lot of
in-depth knowledge of C++ at least for a proof of concept.  I'll try to
open up some JIRAs soon.

Thanks,
Micah

On Tue, Oct 15, 2019 at 10:33 AM Wes McKinney  wrote:


hi Micah,

Definitely Bazel is worth exploring, but we must be realistic about
the amount of energy (several hundred hours or more) that's been
invested in the build system we have now. So a new build system will
be a large endeavor, but hopefully can make things simpler.

Aside from the requirements gathering process, if it is felt that
Bazel is a possible path forward in the future, it may be good to try
to break up the work into more tractable pieces. For example, a first
step would be to set up Bazel configurations to build the project's
thirdparty toolchain. Since we're reliant in ExternalProject in CMake
to do a lot of heavy lifting there for us, I imagine this (taking care
of what ThirdpartyToolchain.cmake does not) will take up a lot of the
energy

- Wes

On Sun, Oct 13, 2019 at 1:06 PM Micah Kornfield 
wrote:





This might be taking the thread on more of a tangent, but maybe we

should

start collecting requirements for the C++ build system in general and see
if there might be better solution that can address some of these

concerns?

In particular, Bazel at least on the surface seems like it might be a
better fit for some of the use cases discussed here.  I know this is a

big

project (and I currently don't have much bandwidth for it) but I think if
CMake is lacking in these areas it might be worth at least exploring
instead of going down the path of building our own meta-build system on

top

of CMake.

Requirements that I think we are targeting:
1.  Be able to provide an out of box build system that requires as close

to

zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD minimal"
works on any C++ developers desktop without additional requirements)
2.  The build system should limit configuration knobs in favor of implied
dependencies (e.g. "$BUILD python" automatically builds "compute",
"filesystem", "ipc")
3.  The build system should be configurable to use (and have the user
specify) one of "System packages", "Conda packages" or source packages

for

providing dependencies (and fallback options between the three).
4.  The build system should be able to treat some dependencies as

optional

(e.g. different compression libraries or allocators).
5.  Easily allow developers to limit building unnecessary code for their
particular task at hand.
6.  The build system must work across the following toolchains/platforms:
 - Linux:  g++ and clang.  x86 and ARM
 - Mac
 - Windows (msys2 and MSVC)

Thanks,
Micah



On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou 

wrote:




Yes, we could express dependencies in a Python script and have it
generate a CMake module of if/else chains in cmake_modules (which we
would check in git to avoid having people depend on a Python install,
perhaps).

Still, that is an additional maintenance burden.

Regards

Antoine.


Le 10/10/2019 à 14:50, Wes McKinney a écrit :

I guess one question we should first discuss is: who is the C++ build
system for?

The users who are most sensitive to benchmark-driven decision making
will generally be consuming the project through pre-built binaries,
like our Python or R packages. If C++ developers build the project
from source and don't do a minimal read of the documentation to see
what a "recommended configuration" looks like, I would say that is
more their fault than ours. In the case of the ARROW_JEMALLOC option,
I think it's important for C++ system integrators to be aware of the
impact of the choice of memory allocator.

The concern I have with the current "out of the box" experience is
that people are getting the impression that "I have to build $X, $Y,
and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
They can, of course, read the documentation and learn that those
things can be toggled off, but I think the user that reaches for a
self-built source install is much different in general than someone
who uses the project through the Linux binary packages, for example.

On the subject of managing intraproject dependencies and
relationships, I think we should develop a better way to express
relationships between components than we have now.

As an example, building the Python library assumes that various
components are enabled

- ARROW_COMPUTE=ON
- ARROW_FILESYSTEM=ON
- ARROW_IPC=ON

Somewhere in the code we might have some code like

if (ARROW_PYTHON)
   

Arrow sync call October 16 at 12:00 US/Eastern, 16:00 UTC

2019-10-16 Thread Neal Richardson
Hi all, our biweekly call is coming up in a couple of hours at
https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes
will be sent out to the mailing list afterwards.

Neal


[NIGHTLY] Arrow Build Report for Job nightly-2019-10-16-0

2019-10-16 Thread Crossbow


Arrow Build Report for Job nightly-2019-10-16-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0

Failed Tasks:
- wheel-manylinux1-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux1-cp27mu
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-appveyor-wheel-win-cp37m
- docker-clang-format:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-clang-format
- wheel-manylinux1-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux1-cp37m
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-gandiva-jar-osx
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-osx-cp35m
- ubuntu-disco:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-ubuntu-disco
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-debian-stretch
- wheel-manylinux2010-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux2010-cp27mu
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-appveyor-wheel-win-cp36m
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-osx-cp37m
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-debian-buster
- wheel-manylinux1-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux1-cp35m
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux2010-cp35m
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-homebrew-cpp
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-gandiva-jar-trusty
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux2010-cp37m
- wheel-manylinux2010-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux2010-cp36m
- wheel-osx-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-osx-cp27m
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-osx-cp36m
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-conda-linux-gcc-py37
- wheel-manylinux1-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux1-cp36m
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-conda-linux-gcc-py27

Succeeded Tasks:
- docker-dask-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-dask-integration
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-centos-7
- docker-python-2.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-python-2.7
- docker-spark-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-spark-integration
- docker-turbodbc-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-turbodbc-integration
- docker-cpp-release:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-cpp-release
- wheel-manylinux2010-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-travis-wheel-manylinux2010-cp27m
- docker-cpp-cmake32:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-cpp-cmake32
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-conda-osx-clang-py37
- docker-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-circle-docker-pandas-master
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-16-0-azure-conda-osx-clang-py36
- docker-docs:
  URL: 

[jira] [Created] (ARROW-6901) [Rust][Parquet] Rust Parquet SerializedFileWriter writes total_num_rows as zero

2019-10-16 Thread Matthew Franglen (Jira)
Matthew Franglen created ARROW-6901:
---

 Summary: [Rust][Parquet] Rust Parquet SerializedFileWriter writes 
total_num_rows as zero
 Key: ARROW-6901
 URL: https://issues.apache.org/jira/browse/ARROW-6901
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 0.15.0, 0.14.1
Reporter: Matthew Franglen


The SerializedFileWriter does not update total_num_rows at any point. This 
results in consistently writing zero as the number of rows in the file.

 

This code will fail:
{code:java}
let data = vec![vec![1, 2, 3, 4, 5]];
let file = ...; // a file path herelet schema = Rc::new(
types::Type::group_type_builder("schema")
.with_fields( vec![Rc::new(
types::Type::primitive_type_builder("col1", Type::INT32)
.with_repetition(Repetition::REQUIRED)
.build()
.unwrap(),
)])
.build()
.unwrap(),
);
let props = Rc::new(WriterProperties::builder().build());
let mut file_writer =
SerializedFileWriter::new(file.try_clone().unwrap(), schema, 
props).unwrap();
let mut rows: i64 = 0;for subset in  {
let mut row_group_writer = file_writer.next_row_group().unwrap();
let col_writer = row_group_writer.next_column().unwrap();
if let Some(mut writer) = col_writer {
match writer {
ColumnWriter::Int32ColumnWriter(ref mut typed) => {
rows += typed.write_batch([..], None, None).unwrap() as 
i64;
}
_ => {
unimplemented!();
}
}
row_group_writer.close_column(writer).unwrap();
}
file_writer.close_row_group(row_group_writer).unwrap();
}file_writer.close().unwrap();let reader = 
SerializedFileReader::new(file).unwrap();
assert_eq!(reader.num_row_groups(), data.len());
assert_eq!(reader.metadata().file_metadata().num_rows(), rows, "row count in 
metadata not equal to number of rows written");
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6900) PyArrow cant serialize pandas IntegerArray

2019-10-16 Thread Sayed Mohammad Hossein Torabi (Jira)
Sayed Mohammad Hossein Torabi created ARROW-6900:


 Summary: PyArrow cant serialize pandas IntegerArray
 Key: ARROW-6900
 URL: https://issues.apache.org/jira/browse/ARROW-6900
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0
 Environment: python 3.7
Reporter: Sayed Mohammad Hossein Torabi


PyArrow cant serialize pandas `IntegerArray` and return this error below:
{code:python}
SerializationCallbackError: pyarrow does not know how to serialize objects of 
type .{code}
To reproduce this bug just run this bunch of code
{code:python}
import pandas as pd 
import pyarrow as pa
from  pandas.core.arrays.integer import IntegerArray

int_array = pd.array([1, None, 3], dtype=pd.Int32Dtype())
pa.default_serialization_context().serialize(int_array).to_buffer().to_pybytes()
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6899) to_pandas() not implemented on list

2019-10-16 Thread Razvan Chitu (Jira)
Razvan Chitu created ARROW-6899:
---

 Summary: to_pandas() not implemented on 
list
 Key: ARROW-6899
 URL: https://issues.apache.org/jira/browse/ARROW-6899
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0, 0.13.0
Reporter: Razvan Chitu
 Attachments: encoded.arrow

Hi,

{{pyarrow.Table.to_pandas()}} fails on an Arrow List Vector where the data 
vector is of type "dictionary encoded string". Here is the table schema as 
printed by pyarrow:
{code:java}
pyarrow.Table
encodedList: list<$data$: dictionary 
not null> not null
  child 0, $data$: dictionary not null
metadata

OrderedDict() {code}
and the data (also attached in a file to this ticket)
{code:java}

[
  [

-- dictionary:
  [
"a",
"b",
"c",
"d"
  ]
-- indices:
  [
0,
1,
2
  ],

-- dictionary:
  [
"a",
"b",
"c",
"d"
  ]
-- indices:
  [
0,
3
  ]
  ]
] {code}
and the exception I got
{code:java}
---
ArrowNotImplementedError  Traceback (most recent call last)
 in 
> 1 df.to_pandas()

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/array.pxi
 in pyarrow.lib._PandasConvertible.to_pandas()

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
 in pyarrow.lib.Table._to_pandas()

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
 in table_to_blockmanager(options, table, categories, ignore_metadata)
700 
701 _check_data_column_metadata_consistency(all_columns)
--> 702 blocks = _table_to_blocks(options, table, categories)
703 columns = _deserialize_column_index(table, all_columns, 
column_indexes)
704 

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
 in _table_to_blocks(options, block_table, categories)
972 
973 # Convert an arrow table to Block from the internal pandas API
--> 974 result = pa.lib.table_to_blocks(options, block_table, categories)
975 
976 # Defined above

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
 in pyarrow.lib.table_to_blocks()

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/error.pxi
 in pyarrow.lib.check_status()

ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: 
dictionary {code}
Note that the data vector itself can be loaded successfully by to_pandas.

It'd be great if this would be addressed in the next version of pyarrow. For 
now, is there anything I can do on my end to bypass this unimplemented 
conversion?

Thanks,

Razvan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

2019-10-16 Thread Micah Kornfield
I'll plan on starting a vote in the next day or two if there are no further
objections/comments.

On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield 
wrote:

> I think the only point asked on the PR that I think is worth discussing is
> assumptions about dictionaries at the beginning of streams.
>
> There are two options:
> 1.  Based on the current wording, it does not seem that all dictionaries
> need to be at the beginning of the stream if they aren't made use of in the
> first record batch (i.e. a dictionary encoded column is all null in the
> first record batch).
> 2.  We require a dictionary batch for each dictionary at the beginning of
> the stream (and require implementations to send an empty batch if they
> don't have the dictionary available).
>
> The current proposal in the PR is option #1.
>
> Thanks,
> Micah
>
> On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield 
> wrote:
>
>> I've opened a pull request [1] to clarify some recent conversations about
>> semantics/edge cases for dictionary encoding [2][3] around interleaved
>> batches and when isDelta=False.
>>
>> Specifically, it proposes isDelta=False indicates dictionary
>> replacement.  For the file format, only one isDelta=False batch is allowed
>> per file and isDelta=true batches are applied in the order supplied file
>> footer.
>>
>> In addition, I've added a new enum to DictionaryEncoding to preserve
>> future compatibility in case we want to expand dictionary encoding to be an
>> explicit mapping from "ID" to "VALUE" as discussed in [4].
>>
>> Once people have had a change to review and come to a consensus. I will
>> call a formal vote to approve the change commit the change.
>>
>> Thanks,
>> Micah
>>
>> [1] https://github.com/apache/arrow/pull/5585
>> [2]
>> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E
>> [3]
>> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E
>> [4]
>> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E
>>
>>


Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread Micah Kornfield
Still thinking through the implications here, but to save others from
having to go search [1] is the PR.

[1] https://github.com/apache/arrow/pull/5663/files

On Tue, Oct 15, 2019 at 1:42 PM John Muehlhausen  wrote:

> A proposal with linked PR now exists in ARROW-5916 and Wes commented that
> we should kick it around some more.
>
> The high-level topic is how Apache Arrow intersects with streaming
> methodologies:
>
> If record batches are strictly immutable, a difficult trade-off is created
> for streaming data collection: either I can have low-latency presentation
> of new data by appending very small batches (often 1 row) to the IPC stream
> and lose columnar layout benefits, or I can have high-latency presentation
> of new data by waiting to append a batch until it is large enough to gain
> significant columnar layout benefits.  During this waiting period the new
> data is unavailable to processing.
>
> If, on the other hand, [0,length) of a batch is immutable but length may
> increase, the trade-off is eliminated: I can pre-allocate a batch and
> populate records in it when they occur (without waiting), and also gain
> columnar benefits as each "closed" batch will be large.  (A batch may be
> practically "closed" before the arrays are full when the projection of
> variable-length buffer space is wrong... a space/time tradeoff in favor of
> time.)
>
> Looking ahead to a day when the reference implementation(s) will be able to
> bump RecordBatch.length while populating pre-allocated records
> in-place, ARROW-5916 reads such batches by ignoring portions of arrays that
> are beyond RecordBatch.length.
>
> If we are not looking ahead to such a day, the discussion is about the
> alternative way that Arrow will avoid the latency/locality tradeoff
> inherent in streaming data collection.  Or, if the answer is "streaming
> apps are and will always be out of scope", that idea needs to be defended
> from the observation that practitioners are moving more towards the fusion
> of batch and streaming, not away from it.
>
> As a practical matter, the reason metadata is not a good solution for me is
> that it requires awareness on the part of the reader.  I want (e.g.) a
> researcher in Python to be able to map a file of batches in IPC format
> without needing to worry about the fact that the file was built in a
> streaming fashion and therefore has some unused array elements.
>
> The change itself seems relatively simple.  What negative consequences do
> we anticipate, if any?
>
> Thanks,
> -John
>
> On Fri, Jul 5, 2019 at 10:42 AM John Muehlhausen  wrote:
>
> > This seems to help... still testing it though.
> >
> >   Status GetFieldMetadata(int field_index, ArrayData* out) {
> > auto nodes = metadata_->nodes();
> > // pop off a field
> > if (field_index >= static_cast(nodes->size())) {
> >   return Status::Invalid("Ran out of field metadata, likely
> > malformed");
> > }
> > const flatbuf::FieldNode* node = nodes->Get(field_index);
> >
> > *//out->length = node->length();*
> > *out->length = metadata_->length();*
> > out->null_count = node->null_count();
> > out->offset = 0;
> > return Status::OK();
> >   }
> >
> > On Fri, Jul 5, 2019 at 10:24 AM John Muehlhausen  wrote:
> >
> >> So far it seems as if pyarrow is completely ignoring the
> >> RecordBatch.length field.  More info to follow...
> >>
> >> On Tue, Jul 2, 2019 at 3:02 PM John Muehlhausen  wrote:
> >>
> >>> Crikey! I'll do some testing around that and suggest some test cases to
> >>> ensure it continues to work, assuming that it does.
> >>>
> >>> -John
> >>>
> >>> On Tue, Jul 2, 2019 at 2:41 PM Wes McKinney 
> wrote:
> >>>
>  Thanks for the attachment, it's helpful.
> 
>  On Tue, Jul 2, 2019 at 1:40 PM John Muehlhausen  wrote:
>  >
>  > Attachments referred to in previous two messages:
>  >
> 
> https://www.dropbox.com/sh/6ycfuivrx70q2jx/AAAt-RDaZWmQ2VqlM-0s6TqWa?dl=0
>  >
>  > On Tue, Jul 2, 2019 at 1:14 PM John Muehlhausen 
> wrote:
>  >
>  > > Thanks, Wes, for the thoughtful reply.  I really appreciate the
>  > > engagement.  In order to clarify things a bit, I am attaching a
>  graphic of
>  > > how our application will take record-wise (row-oriented) data from
>  an event
>  > > source and incrementally populate a pre-allocated Arrow-compatible
>  buffer,
>  > > including for variable-length fields.  (Obviously at this stage I
>  am not
>  > > using the reference implementation Arrow code, although that would
>  be a
>  > > goal to contribute that back to the project.)
>  > >
>  > > For sake of simplicity these are non-nullable fields.  As a
> result a
>  > > reader of "y" that has no knowledge of the "utilized" metadata
>  would get a
>  > > long string (zeros, spaces, uninitialized, or whatever we decide
>  for the
>  > > pre-allocation model) for the record just beyond the last utilized

[jira] [Created] (ARROW-6898) [Java] Fix potential memory leak in ArrowWriter and several test classes

2019-10-16 Thread Ji Liu (Jira)
Ji Liu created ARROW-6898:
-

 Summary: [Java] Fix potential memory leak in ArrowWriter and 
several test classes
 Key: ARROW-6898
 URL: https://issues.apache.org/jira/browse/ARROW-6898
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


ARROW-6040 fixed the problem that dictionary entries are required in IPC 
streams even when empty, which only writes dictionaries when there are at least 
one batch. In this way, if we write empty stream and invoke ArrowWriter#close, 
the dictionaries are not closed leading to memory leak (they are closed after 
the write operation), and it’s really hard to debug, this problem was found by 
{{TestArrowReaderWriter#testEmptyStreamInStreamingIPC}} when I tried to close 
allocator after the test. 

 

Besides, there are several test classes have potential memory leak without 
closing allocator/vector/buf etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)