change in pyarrow scalar equality?

2020-08-05 Thread Bryan Cutler
Hi all,

I came across a behavior change from 0.17.1 when comparing array scalar
values with python objects. This used to work for 0.17.1 and before, but in
1.0.0 equals always returns false. I saw there was a previous discussion on
Python equality semantics, but not sure if the conclusion is the behavior
I'm seeing. For example:

In [4]: a = pa.array([1,2,3])


In [5]: a[0] == 1

Out[5]: False

In [6]: a[0].as_py() == 1

Out[6]: True

I know the scalars can be converted with `as_py()`, but it does seem a
little strange to return False when compared with a python object. Is this
the expected behavior for 1.0.0+?

Thanks,
Bryan


Re: [DISCUSS] Support of higher bit-width Decimal type

2020-08-05 Thread Micah Kornfield
>
> Sounds fine to me. I guess one question is what needs to be formalized
> in the Schema.fbs files or elsewhere in the columnar format
> documentation (and we will need to hold an associated vote for that I
> think)

Yes, i think we will need to hold a vote for it.  Since this is essentially
a "new type", I was planning on trying to develop on the branch for
Schema/Documentation/Code and then have a vote to merge it to master.

On Wed, Aug 5, 2020 at 12:00 PM Wes McKinney  wrote:

> Sounds fine to me. I guess one question is what needs to be formalized
> in the Schema.fbs files or elsewhere in the columnar format
> documentation (and we will need to hold an associated vote for that I
> think)
>
> On Mon, Aug 3, 2020 at 11:30 PM Micah Kornfield 
> wrote:
> >
> > Given no objections, we'll go ahead and start implementing support for
> 256-bit decimals.
> >
> > I'm considering setting up another branch to develop all the components
> so they can be merged to master atomically.
> >
> > Thanks,
> > Micah
> >
> > On Tue, Jul 28, 2020 at 6:39 AM Wes McKinney 
> wrote:
> >>
> >> Generally this sounds fine to me. At some point it would be good to
> >> add 32-bit and 64-bit decimal support but this can be done in the
> >> future.
> >>
> >> On Tue, Jul 28, 2020 at 7:28 AM Fan Liya  wrote:
> >> >
> >> > Hi Micah,
> >> >
> >> > Thanks for opening the discussion.
> >> > I am aware of some scenarios where decimal requires more than 16
> bytes, so
> >> > I think it would be beneficial to support this in Arrow.
> >> >
> >> > Best,
> >> > Liya Fan
> >> >
> >> >
> >> > On Tue, Jul 28, 2020 at 11:12 AM Micah Kornfield <
> emkornfi...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi Arrow Dev,
> >> > > ZetaSQL (Google's open source standard SQL library) recently
> introduced a
> >> > > BigNumeric [1] type which requires a 256 bit width to properly
> support it.
> >> > > I'd like to add support (possibly in collaboration with some of my
> >> > > colleagues) to add support for 256 bit width Decimals in Arrow to
> support a
> >> > > type corresponding to BigNumeric.
> >> > >
> >> > > In past discussions on this, I don't think we established a minimum
> bar for
> >> > > supporting additional bit-widths within Arrow.
> >> > >
> >> > > I'd like to propose the following requirements:
> >> > > 1.  A vote agreeing on adding support for a new bitwidth (we can
> discuss
> >> > > any objections here).
> >> > > 2.  Support in Java and C++ for integration tests verifying the
> ability to
> >> > > round-trip the value.
> >> > > 3.  Support in Java for conversion to/from BigDecimal [2]
> >> > > 4.  Support in Python converting to/from Decimal [3]
> >> > >
> >> > > Is there anything else that people feel like is a requirement for
> basic
> >> > > support of an additional bit width for Decimal's?
> >> > >
> >> > > Thanks,
> >> > > Micah
> >> > >
> >> > >
> >> > > [1]
> >> > >
> >> > >
> https://github.com/google/zetasql/blob/1aefaa7c62fc7a50def879bb7c4225ec6974b7ef/zetasql/public/numeric_value.h#L486
> >> > > [2]
> https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html
> >> > > [3] https://docs.python.org/3/library/decimal.html
> >> > >
>


Re: [DISCUSS][C++] Group by operation for RecordBatch and Table

2020-08-05 Thread Wes McKinney
I see there's a bunch of additional aggregation code in Dremio that
might serve as inspiration (some of which is related to distributed
aggregation, so may not be relevant)

https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/aggregate

Maybe Andy or one of the other active Rust DataFusion developers can
comment on the approach taken for hash aggs there

On Wed, Aug 5, 2020 at 1:52 PM Wes McKinney  wrote:
>
> hi Kenta,
>
> Yes, I think it only makes sense to implement this in the context of
> the query engine project. Here's a list of assorted thoughts about it:
>
> * I have been mentally planning to follow the Vectorwise-type query
> engine architecture that's discussed in [1] [2] and many other
> academic papers. I believe this is how some other current generation
> open source columnar query engines work, such as Dremio [3] and DuckDB
> [4][5].
> * Hash (aka "group") aggregations need to be able to process arbitrary
> expressions, not only a plain input column. So it's not enough to be
> able to compute "sum(x) group by y" where "x" and "y" are fields in a
> RecordBatch, we need to be able to compute "$AGG_FUNC($EXPR) GROUP BY
> $GROUP_EXPR_1, $GROUP_EXPR_2, ..." where $EXPR / $GROUP_EXPR_1 / ...
> are any column expressions computing from the input relations (keep in
> mind that an aggregation could apply to stream of record batches
> produced by a join). In any case, expression evaluation is a
> closely-related task and should be implemented ASAP.
> * Hash aggregation functions themselves should probably be introduced
> as a new Function type in arrow::compute. I don't think it would be
> appropriate to use the existing "SCALAR_AGGREGATE" functions, instead
> we should introduce a new HASH_AGGREGATE function type that accepts
> input data to be aggregated along with an array of pre-computed bucket
> ids (which are computed by probing the HT). So rather than
> Update(state, args) like we have for scalar aggregate, the primary
> interface for group aggregation is Update(state, bucket_ids, args)
> * The HashAggregation operator should be able to process an arbitrary
> iterator of record batches
> * We will probably want to adapt an existing or implement a new
> concurrent hash table so that aggregations can be performed in
> parallel without requiring a post-aggregation merge step
> * There's some general support machinery for hashing multiple fields
> and then doing efficient vectorized hash table probes (to assign
> aggregation bucket id's to each row position)
>
> I think it is worth investing the effort to build something that is
> reasonably consistent with the "state of the art" in database systems
> (at least according to what we are able to build with our current
> resources) rather than building something more crude that has to be
> replaced with new implementation later.
>
> I'd like to help personally with this work (particularly since the
> natural next step with my recent work in arrow/compute is to implement
> expression evaluation) but I won't have significant bandwidth for it
> until later this month or early September. If someone feels that they
> sufficiently understand the state of the art for this type of workload
> and wants to help with laying down the abstract C++ APIs for
> Volcano-style query execution and an implementation of hash
> aggregation, that sounds great.
>
> Thanks,
> Wes
>
> [1]: https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf
> [2]: https://github.com/TimoKersten/db-engine-paradigms
> [3]: 
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/aggregate/hash
> [4]: 
> https://github.com/cwida/duckdb/blob/master/src/include/duckdb/execution/aggregate_hashtable.hpp
> [5]: 
> https://github.com/cwida/duckdb/blob/master/src/execution/aggregate_hashtable.cpp
>
> On Wed, Aug 5, 2020 at 10:23 AM Kenta Murata  wrote:
> >
> > Hi folks,
> >
> > Red Arrow, the Ruby binding of Arrow GLib, implements grouped aggregation
> > features for RecordBatch and Table.  Because these features are written in
> > Ruby, they are too slow for large size data.  We need to make them much
> > faster.
> >
> > To improve their calculation speed, they should be written in C++, and
> > should be put in Arrow C++ instead of Red Arrow.
> >
> > Is anyone working on implementing group-by operation for RecordBatch and
> > Table in Arrow C++?  If no one has worked on it, I would like to try it.
> >
> > By the way, I found that the grouped aggregation feature is mentioned in
> > the design document of Arrow C++ Query Engine.  Is Query Engine, not Arrow
> > C++ Core, a suitable location to implement group-by operation?


Re: [DISCUSS] Support of higher bit-width Decimal type

2020-08-05 Thread Wes McKinney
Sounds fine to me. I guess one question is what needs to be formalized
in the Schema.fbs files or elsewhere in the columnar format
documentation (and we will need to hold an associated vote for that I
think)

On Mon, Aug 3, 2020 at 11:30 PM Micah Kornfield  wrote:
>
> Given no objections, we'll go ahead and start implementing support for 
> 256-bit decimals.
>
> I'm considering setting up another branch to develop all the components so 
> they can be merged to master atomically.
>
> Thanks,
> Micah
>
> On Tue, Jul 28, 2020 at 6:39 AM Wes McKinney  wrote:
>>
>> Generally this sounds fine to me. At some point it would be good to
>> add 32-bit and 64-bit decimal support but this can be done in the
>> future.
>>
>> On Tue, Jul 28, 2020 at 7:28 AM Fan Liya  wrote:
>> >
>> > Hi Micah,
>> >
>> > Thanks for opening the discussion.
>> > I am aware of some scenarios where decimal requires more than 16 bytes, so
>> > I think it would be beneficial to support this in Arrow.
>> >
>> > Best,
>> > Liya Fan
>> >
>> >
>> > On Tue, Jul 28, 2020 at 11:12 AM Micah Kornfield 
>> > wrote:
>> >
>> > > Hi Arrow Dev,
>> > > ZetaSQL (Google's open source standard SQL library) recently introduced a
>> > > BigNumeric [1] type which requires a 256 bit width to properly support 
>> > > it.
>> > > I'd like to add support (possibly in collaboration with some of my
>> > > colleagues) to add support for 256 bit width Decimals in Arrow to 
>> > > support a
>> > > type corresponding to BigNumeric.
>> > >
>> > > In past discussions on this, I don't think we established a minimum bar 
>> > > for
>> > > supporting additional bit-widths within Arrow.
>> > >
>> > > I'd like to propose the following requirements:
>> > > 1.  A vote agreeing on adding support for a new bitwidth (we can discuss
>> > > any objections here).
>> > > 2.  Support in Java and C++ for integration tests verifying the ability 
>> > > to
>> > > round-trip the value.
>> > > 3.  Support in Java for conversion to/from BigDecimal [2]
>> > > 4.  Support in Python converting to/from Decimal [3]
>> > >
>> > > Is there anything else that people feel like is a requirement for basic
>> > > support of an additional bit width for Decimal's?
>> > >
>> > > Thanks,
>> > > Micah
>> > >
>> > >
>> > > [1]
>> > >
>> > > https://github.com/google/zetasql/blob/1aefaa7c62fc7a50def879bb7c4225ec6974b7ef/zetasql/public/numeric_value.h#L486
>> > > [2] https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html
>> > > [3] https://docs.python.org/3/library/decimal.html
>> > >


Re: [DISCUSS][C++] Group by operation for RecordBatch and Table

2020-08-05 Thread Wes McKinney
hi Kenta,

Yes, I think it only makes sense to implement this in the context of
the query engine project. Here's a list of assorted thoughts about it:

* I have been mentally planning to follow the Vectorwise-type query
engine architecture that's discussed in [1] [2] and many other
academic papers. I believe this is how some other current generation
open source columnar query engines work, such as Dremio [3] and DuckDB
[4][5].
* Hash (aka "group") aggregations need to be able to process arbitrary
expressions, not only a plain input column. So it's not enough to be
able to compute "sum(x) group by y" where "x" and "y" are fields in a
RecordBatch, we need to be able to compute "$AGG_FUNC($EXPR) GROUP BY
$GROUP_EXPR_1, $GROUP_EXPR_2, ..." where $EXPR / $GROUP_EXPR_1 / ...
are any column expressions computing from the input relations (keep in
mind that an aggregation could apply to stream of record batches
produced by a join). In any case, expression evaluation is a
closely-related task and should be implemented ASAP.
* Hash aggregation functions themselves should probably be introduced
as a new Function type in arrow::compute. I don't think it would be
appropriate to use the existing "SCALAR_AGGREGATE" functions, instead
we should introduce a new HASH_AGGREGATE function type that accepts
input data to be aggregated along with an array of pre-computed bucket
ids (which are computed by probing the HT). So rather than
Update(state, args) like we have for scalar aggregate, the primary
interface for group aggregation is Update(state, bucket_ids, args)
* The HashAggregation operator should be able to process an arbitrary
iterator of record batches
* We will probably want to adapt an existing or implement a new
concurrent hash table so that aggregations can be performed in
parallel without requiring a post-aggregation merge step
* There's some general support machinery for hashing multiple fields
and then doing efficient vectorized hash table probes (to assign
aggregation bucket id's to each row position)

I think it is worth investing the effort to build something that is
reasonably consistent with the "state of the art" in database systems
(at least according to what we are able to build with our current
resources) rather than building something more crude that has to be
replaced with new implementation later.

I'd like to help personally with this work (particularly since the
natural next step with my recent work in arrow/compute is to implement
expression evaluation) but I won't have significant bandwidth for it
until later this month or early September. If someone feels that they
sufficiently understand the state of the art for this type of workload
and wants to help with laying down the abstract C++ APIs for
Volcano-style query execution and an implementation of hash
aggregation, that sounds great.

Thanks,
Wes

[1]: https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf
[2]: https://github.com/TimoKersten/db-engine-paradigms
[3]: 
https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/aggregate/hash
[4]: 
https://github.com/cwida/duckdb/blob/master/src/include/duckdb/execution/aggregate_hashtable.hpp
[5]: 
https://github.com/cwida/duckdb/blob/master/src/execution/aggregate_hashtable.cpp

On Wed, Aug 5, 2020 at 10:23 AM Kenta Murata  wrote:
>
> Hi folks,
>
> Red Arrow, the Ruby binding of Arrow GLib, implements grouped aggregation
> features for RecordBatch and Table.  Because these features are written in
> Ruby, they are too slow for large size data.  We need to make them much
> faster.
>
> To improve their calculation speed, they should be written in C++, and
> should be put in Arrow C++ instead of Red Arrow.
>
> Is anyone working on implementing group-by operation for RecordBatch and
> Table in Arrow C++?  If no one has worked on it, I would like to try it.
>
> By the way, I found that the grouped aggregation feature is mentioned in
> the design document of Arrow C++ Query Engine.  Is Query Engine, not Arrow
> C++ Core, a suitable location to implement group-by operation?


Re: Arrow sync call August 5 at 12:00 US/Eastern, 16:00 UTC

2020-08-05 Thread Neal Richardson
Attendees:
Projjal Chanda
Fred Gan
Andy Grove
Todd Hendricks
Jörn Horstmann
Ben Kietzman
Rok Mihevc
Neal Richardson
Paul Taylor
Andrew Wieteska

Discussion
* 1.0.1
  * Andy: Rust packaging issue, need to test on published crate
  * Timing: week of August 17
* Bug in dictionary batches in device memory: Paul hit segfault trying to
update cudf to use arrow 1.0, will make PR
* Rust data frame projects, plans for collaboration (Jörn, Andy)

On Tue, Aug 4, 2020 at 1:55 PM Neal Richardson 
wrote:

> Hi all,
> Reminder that our biweekly call is tomorrow at
> https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes will
> be sent out to the mailing list afterward.
>
> Neal
>


[DISCUSS][C++] Group by operation for RecordBatch and Table

2020-08-05 Thread Kenta Murata
Hi folks,

Red Arrow, the Ruby binding of Arrow GLib, implements grouped aggregation
features for RecordBatch and Table.  Because these features are written in
Ruby, they are too slow for large size data.  We need to make them much
faster.

To improve their calculation speed, they should be written in C++, and
should be put in Arrow C++ instead of Red Arrow.

Is anyone working on implementing group-by operation for RecordBatch and
Table in Arrow C++?  If no one has worked on it, I would like to try it.

By the way, I found that the grouped aggregation feature is mentioned in
the design document of Arrow C++ Query Engine.  Is Query Engine, not Arrow
C++ Core, a suitable location to implement group-by operation?


Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-08-05 Thread Radu Teodorescu


> I will have a closer look and comment most likely next week.

Thank you!

> 
> Unfortunately, having code developed in external repositories increases the
> complexity of importing that code back into the Apache project  Not sure if
> you’re interested in preemptively following the project’s style guide (file
> naming, C++ code style, etc) but that would also help.

I understand that challenge, my intent was to prove to myself and anyone else, 
that there is a satisfying implementation that provides the semantics and the 
performance levels I am referring to in my proposals. It is a reference 
implementation, but certainly not something that can be dropped in directly in 
its current form (for example, I am leaning quite heavily on c++14/17 and a bit 
of 20), but if the vision makes sense I would love to bring that into arrow.

> On Wed, Aug 5, 2020 at 7:43 AM Radu Teodorescu 
> wrote:
> 
>> Wes & crew,
>> Congratulations and thank you for the successful 1.0 rollout , it is
>> certainly making a huge difference for my day job!
>> Is it a good time now to revive the conversation below? (and
>> https://github.com/apache/arrow/pull/7548 )
>> I have also gone ahead and released a prototype the covers some of the
>> more hand wavy parts of my interface proposal (aka ways to compose arrays
>> in a dataframe that controls the balance between fragmentation and buffer
>> copying) - it is here: https://github.com/raduteo/framespaces/tree/master
>>  and it lacks in
>> documentation but the basic data structures are robustly implemented and
>> tested so if we find merits in the original PR:
>> https://github.com/apache/arrow/pull/7548 <
>> https://github.com/apache/arrow/pull/7548> , there should be a reasonable
>> path for implementing most of it.
>> 
>> Thank you
>> Radu
>> 
>> 
>>> On Jun 25, 2020, at 3:10 PM, Radu Teodorescu
>>  wrote:
>>> 
>>> Understood and agreed
>>> My proposal really addresses a number of mechanisms on layer 2 (
>> "Virtual" tables) in your taxonomy (I can adjust interface names
>> accordingly as part of the review process).
>>> One additional element I am proposing here is the ability to insert and
>> modify rows in a vectorized fashion - they follow the same mechanics as
>> “filter” which is effectively (i.e. row removal)
>>> and I think they are quite important as an efficiently supported
>> construct (for things like data cleanup, data set updates etc.)
>>> 
>>> I’m really looking forward to hear more of your thoughts (as well as
>> anybody else’s who is interested in this topic)
>>> Radu
>>> 
>>> 
 On Jun 25, 2020, at 2:52 PM, Wes McKinney  wrote:
 
 hi Radu,
 
 It's going to be challenging for me to review in detail until after
 the 1.0.0 release is out, but in general I think there are 3 layers
 that we need to be talking about:
 
 * Materialized in-memory tables
 * "Virtual" tables, whose in-memory/not-in-memory semantics are not
 exposed -- permitting column selection, iteration as for execution of
 query engine operators (e.g. projection, filter, join, aggregate), and
 random access
 * "Data Frame API": a programming interface for expressing analytical
 operations on virtual tables. A data frame could be exported to
 materialized tables / record batches e.g. for writing to Parquet or
 IPC streams
 
 In principle the "Data Frame API" shouldn't need to know much about
 the first two layers, instead working with high level primitives and
 leaving the execution of those primitives to the layers below. Does
 this make sense?
 
 I think we should be pretty strict about separation of concerns
 between these three layers . I'll dig in in more detail sometime after
 July 4.
 
 Thanks
 Wes
 
 
 
 
 On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu
  wrote:
> 
> Here it is as a pull request:
> https://github.com/apache/arrow/pull/7548 <
>> https://github.com/apache/arrow/pull/7548>
> 
> I hope this can be a starter for an active conversation diving into
>> specifics, and I look forward to contribute with more design and algorithm
>> ideas as well as concrete code.
> 
>> On Jun 17, 2020, at 6:11 PM, Neal Richardson <
>> neal.p.richard...@gmail.com> wrote:
>> 
>> Maybe a draft pull request? If you put "WIP" in the pull request
>> title, CI
>> won't run builds on it, so it's suitable for rough outlines and
>> collecting
>> feedback.
>> 
>> Neal
>> 
>> On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
>>  wrote:
>> 
>>> Thank you Wes!
>>> Yes, both proposals fit very nicely in your Data Frames vision, I
>> see them
>>> as deep dives on some specifics:
>>> - the virtual array doc is more fluffy an probably if you agree with
>> the
>>> general concept, the next logical move is to put out some interfaces
>> indeed
>>> - 

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-08-05 Thread Wes McKinney
I will have a closer look and comment most likely next week.

Unfortunately, having code developed in external repositories increases the
complexity of importing that code back into the Apache project  Not sure if
you’re interested in preemptively following the project’s style guide (file
naming, C++ code style, etc) but that would also help.

On Wed, Aug 5, 2020 at 7:43 AM Radu Teodorescu 
wrote:

> Wes & crew,
> Congratulations and thank you for the successful 1.0 rollout , it is
> certainly making a huge difference for my day job!
> Is it a good time now to revive the conversation below? (and
> https://github.com/apache/arrow/pull/7548 )
> I have also gone ahead and released a prototype the covers some of the
> more hand wavy parts of my interface proposal (aka ways to compose arrays
> in a dataframe that controls the balance between fragmentation and buffer
> copying) - it is here: https://github.com/raduteo/framespaces/tree/master
>  and it lacks in
> documentation but the basic data structures are robustly implemented and
> tested so if we find merits in the original PR:
> https://github.com/apache/arrow/pull/7548 <
> https://github.com/apache/arrow/pull/7548> , there should be a reasonable
> path for implementing most of it.
>
> Thank you
> Radu
>
>
> > On Jun 25, 2020, at 3:10 PM, Radu Teodorescu
>  wrote:
> >
> > Understood and agreed
> > My proposal really addresses a number of mechanisms on layer 2 (
> "Virtual" tables) in your taxonomy (I can adjust interface names
> accordingly as part of the review process).
> > One additional element I am proposing here is the ability to insert and
> modify rows in a vectorized fashion - they follow the same mechanics as
> “filter” which is effectively (i.e. row removal)
> > and I think they are quite important as an efficiently supported
> construct (for things like data cleanup, data set updates etc.)
> >
> > I’m really looking forward to hear more of your thoughts (as well as
> anybody else’s who is interested in this topic)
> > Radu
> >
> >
> >> On Jun 25, 2020, at 2:52 PM, Wes McKinney  wrote:
> >>
> >> hi Radu,
> >>
> >> It's going to be challenging for me to review in detail until after
> >> the 1.0.0 release is out, but in general I think there are 3 layers
> >> that we need to be talking about:
> >>
> >> * Materialized in-memory tables
> >> * "Virtual" tables, whose in-memory/not-in-memory semantics are not
> >> exposed -- permitting column selection, iteration as for execution of
> >> query engine operators (e.g. projection, filter, join, aggregate), and
> >> random access
> >> * "Data Frame API": a programming interface for expressing analytical
> >> operations on virtual tables. A data frame could be exported to
> >> materialized tables / record batches e.g. for writing to Parquet or
> >> IPC streams
> >>
> >> In principle the "Data Frame API" shouldn't need to know much about
> >> the first two layers, instead working with high level primitives and
> >> leaving the execution of those primitives to the layers below. Does
> >> this make sense?
> >>
> >> I think we should be pretty strict about separation of concerns
> >> between these three layers . I'll dig in in more detail sometime after
> >> July 4.
> >>
> >> Thanks
> >> Wes
> >>
> >>
> >>
> >>
> >> On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu
> >>  wrote:
> >>>
> >>> Here it is as a pull request:
> >>> https://github.com/apache/arrow/pull/7548 <
> https://github.com/apache/arrow/pull/7548>
> >>>
> >>> I hope this can be a starter for an active conversation diving into
> specifics, and I look forward to contribute with more design and algorithm
> ideas as well as concrete code.
> >>>
>  On Jun 17, 2020, at 6:11 PM, Neal Richardson <
> neal.p.richard...@gmail.com> wrote:
> 
>  Maybe a draft pull request? If you put "WIP" in the pull request
> title, CI
>  won't run builds on it, so it's suitable for rough outlines and
> collecting
>  feedback.
> 
>  Neal
> 
>  On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
>   wrote:
> 
> > Thank you Wes!
> > Yes, both proposals fit very nicely in your Data Frames vision, I
> see them
> > as deep dives on some specifics:
> > - the virtual array doc is more fluffy an probably if you agree with
> the
> > general concept, the next logical move is to put out some interfaces
> indeed
> > - the random access doc goes into more details and I am curious what
> you
> > think about some of the concepts
> >
> > I will follow up shortly with some interfaces - do you prefer
> references
> > to a repo, inline them in an email or add them as comments to your
> doc?
> >
> >
> >> On Jun 17, 2020, at 4:26 PM, Wes McKinney 
> wrote:
> >>
> >> hi Radu,
> >>
> >> I'll read the proposals in more detail when I can and make comments,
> >> but this has always been something of interest (see, e.g. [1]). The
> >> 

Re: [DISCUSS] How to extended time value range for Timestamp type?

2020-08-05 Thread Wes McKinney
I also am not sure there is a good case for a new built-in type since it
introduces a good deal of complexity, particularly when there is the
extension type option. We’ve been living with 64-bit nanoseconds in pandas
for a decade, for example (and without the option for lower resolutions!!),
and while it does arise as a limitation from time to time, the use cases
are so specialized that it has never made sense to do anything about it.

On Tue, Aug 4, 2020 at 11:26 PM Micah Kornfield 
wrote:

> I think a stronger case needs to be made for adding a new builtin type to
> support this.  Can you provide concrete use-cases?  Why can't dates outside
> of the one representable by int64 be truncated (even for nano precision
> 64-bits max value is is over 200 years in the future)?  It seems like in
> most cases values at the nanosecond level that are outside the values
> representable by 64-bits, are generally sentinel values.
>
> FWIW, Parquet had an int96 type that was used for timestamps but it has
> been deprecated [1] in favor of int64 nanos.
>
> -Micah
>
> [1] https://issues.apache.org/jira/browse/PARQUET-323
>
> On Tue, Aug 4, 2020 at 8:52 PM Fan Liya  wrote:
>
> > Hi Ji,
> >
> > This sounds like a universal requirement, as 64-bit is not sufficient to
> > hold the precision for nano-second.
> >
> > For the extension type, we have two choices:
> > 1. Extending struct(int64, int32), which represents the design of SoA
> > (Struct of Arrays).
> > 2. Extending fixed width binary(12), which represents the design of AoS
> > (Array of Structs)
> >
> > Given the universal requirement, I'd prefer a new type.
> >
> > Best,
> > Liya Fan
> >
> >
> > On Wed, Aug 5, 2020 at 11:18 AM Ji Liu  wrote:
> >
> > > Hi all,
> > >
> > > Now in Arrow Timestamp type, it support different TimeUnit(seconds,
> > > milliseconds, microseconds, nanoseconds) with int64 type for storage.
> In
> > > most cases this is enough, but if the timestamp value range of external
> > > system exceeds int64_t::max, then it's impossible to directly convert
> to
> > > Arrow Timestamp, consider the following user case:
> > >
> > > A timestamp in other system with int64 + int32(stores milliseconds and
> > > nanoseconds) can represent data from -00-00 to -12-31
> > > 23:59:59.9, if we want to convert type like this, how should we
> > do?
> > > One probably create an extension type with struct(int64, int32) for
> > > storage.
> > >
> > > Besides ExtensionType, are we considering extending our Timestamp for
> > wider
> > > range or maybe a new type for cases above?
> > >
> > >
> > > Thanks,
> > > Ji Liu
> > >
> >
>


Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-08-05 Thread Radu Teodorescu
Wes & crew,
Congratulations and thank you for the successful 1.0 rollout , it is certainly 
making a huge difference for my day job!
Is it a good time now to revive the conversation below? (and 
https://github.com/apache/arrow/pull/7548 ) 
I have also gone ahead and released a prototype the covers some of the more 
hand wavy parts of my interface proposal (aka ways to compose arrays in a 
dataframe that controls the balance between fragmentation and buffer  copying) 
- it is here: https://github.com/raduteo/framespaces/tree/master 
 and it lacks in 
documentation but the basic data structures are robustly implemented and tested 
so if we find merits in the original PR: 
https://github.com/apache/arrow/pull/7548 
 , there should be a reasonable path 
for implementing most of it.
 
Thank you
Radu
 

> On Jun 25, 2020, at 3:10 PM, Radu Teodorescu  
> wrote:
> 
> Understood and agreed
> My proposal really addresses a number of mechanisms on layer 2 ( "Virtual" 
> tables) in your taxonomy (I can adjust interface names accordingly as part of 
> the review process).
> One additional element I am proposing here is the ability to insert and 
> modify rows in a vectorized fashion - they follow the same mechanics as 
> “filter” which is effectively (i.e. row removal) 
> and I think they are quite important as an efficiently supported construct 
> (for things like data cleanup, data set updates etc.)
> 
> I’m really looking forward to hear more of your thoughts (as well as anybody 
> else’s who is interested in this topic)
> Radu 
> 
> 
>> On Jun 25, 2020, at 2:52 PM, Wes McKinney  wrote:
>> 
>> hi Radu,
>> 
>> It's going to be challenging for me to review in detail until after
>> the 1.0.0 release is out, but in general I think there are 3 layers
>> that we need to be talking about:
>> 
>> * Materialized in-memory tables
>> * "Virtual" tables, whose in-memory/not-in-memory semantics are not
>> exposed -- permitting column selection, iteration as for execution of
>> query engine operators (e.g. projection, filter, join, aggregate), and
>> random access
>> * "Data Frame API": a programming interface for expressing analytical
>> operations on virtual tables. A data frame could be exported to
>> materialized tables / record batches e.g. for writing to Parquet or
>> IPC streams
>> 
>> In principle the "Data Frame API" shouldn't need to know much about
>> the first two layers, instead working with high level primitives and
>> leaving the execution of those primitives to the layers below. Does
>> this make sense?
>> 
>> I think we should be pretty strict about separation of concerns
>> between these three layers . I'll dig in in more detail sometime after
>> July 4.
>> 
>> Thanks
>> Wes
>> 
>> 
>> 
>> 
>> On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu
>>  wrote:
>>> 
>>> Here it is as a pull request:
>>> https://github.com/apache/arrow/pull/7548 
>>> 
>>> 
>>> I hope this can be a starter for an active conversation diving into 
>>> specifics, and I look forward to contribute with more design and algorithm 
>>> ideas as well as concrete code.
>>> 
 On Jun 17, 2020, at 6:11 PM, Neal Richardson  
 wrote:
 
 Maybe a draft pull request? If you put "WIP" in the pull request title, CI
 won't run builds on it, so it's suitable for rough outlines and collecting
 feedback.
 
 Neal
 
 On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
  wrote:
 
> Thank you Wes!
> Yes, both proposals fit very nicely in your Data Frames vision, I see them
> as deep dives on some specifics:
> - the virtual array doc is more fluffy an probably if you agree with the
> general concept, the next logical move is to put out some interfaces 
> indeed
> - the random access doc goes into more details and I am curious what you
> think about some of the concepts
> 
> I will follow up shortly with some interfaces - do you prefer references
> to a repo, inline them in an email or add them as comments to your doc?
> 
> 
>> On Jun 17, 2020, at 4:26 PM, Wes McKinney  wrote:
>> 
>> hi Radu,
>> 
>> I'll read the proposals in more detail when I can and make comments,
>> but this has always been something of interest (see, e.g. [1]). The
>> intent with the "C++ data frames" project that we've discussed (and I
>> continue to labor towards, e.g. recent compute engine work is directly
>> in service of this) has always been to be able to express computations
>> on non-RAM-resident datasets [2]
>> 
>> As one initial high level point of discussion, I think what you're
>> describing in these documents should probably be _new_ C++ classes and
>> _new_ virtual interfaces, not an evolution of the current arrow::Table
>> or arrow::Array/ChunkedArray classes. One practical path forward in

[NIGHTLY] Arrow Build Report for Job nightly-2020-08-05-0

2020-08-05 Thread Crossbow


Arrow Build Report for Job nightly-2020-08-05-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0

Failed Tasks:
- conda-linux-gcc-py36-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-linux-gcc-py36-cpu
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-linux-gcc-py37-cpu
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-linux-gcc-py37-cuda
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-github-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-github-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-github-test-conda-python-3.7-kartothek-master
- test-r-rstudio-r-base-3.6-bionic:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-test-r-rstudio-r-base-3.6-bionic
- test-r-rstudio-r-base-3.6-opensuse15:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-test-r-rstudio-r-base-3.6-opensuse15
- test-ubuntu-18.04-cpp-release:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-circle-test-ubuntu-18.04-cpp-release
- ubuntu-focal-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-github-ubuntu-focal-amd64

Pending Tasks:
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-osx-clang-py36
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-wheel-manylinux2010-cp35m

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-github-centos-7-amd64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-travis-centos-8-aarch64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-clean
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-linux-gcc-py38-cuda
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-osx-clang-py38
- conda-win-vs2017-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-win-vs2017-py36
- conda-win-vs2017-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-win-vs2017-py37
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-azure-conda-win-vs2017-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-travis-debian-buster-arm64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-github-debian-stretch-amd64
- debian-stretch-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-travis-debian-stretch-arm64
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-travis-gandiva-jar-xenial
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-travis-homebrew-cpp
- homebrew-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-05-0-travis-homebrew-r-autobrew
- 

Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

2020-08-05 Thread Ji Liu
hi liya,
Thanks for your careful review, it is a typo, the order of getBuffers is
wrong.

Fan Liya  于2020年8月5日周三 下午2:14写道:

> Hi Ji,
>
> IMO, for the correct order, the validity buffer should precede the offset
> buffer (e.g. this is the order used by BaseVariableWidthVector &
> BaseLargeVariableWidthVector).
> In ListVector#getBuffers, the offset buffer precedes the validity buffer,
> so I am a little confused why you say the order of ListVector#getBuffers is
> right?
>
> Best,
> Liya Fan
>
> On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield 
> wrote:
>
> > FWIW, I lack historical context on how these methods evolved, so I'd
> > appreciate insight from anyone who has worked on the java codebase for a
> > longer period of time.  The current situation seems less then ideal.
> >
> > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu  wrote:
> >
> > > Hi all,
> > >
> > >
> > > When I worked on ARROW-7539[1], I met some problems and not sure what's
> > the
> > > proper way to solve it.
> > >
> > >
> > > This issue was about to avoid set reader/writer indices in
> > > FieldVector#getFieldBuffers according to the following reasons:
> > >
> > > i. getBuffers set reader/writer indices and it's right for the purpose
> of
> > > sending the data over the wire
> > >
> > > ii. getFieldBuffers is distinct from getBuffers, it should be for
> getting
> > > access to underlying data for higher-performance algorithms
> > >
> > >
> > > Currently in VectorUnloader, we used getFieldBuffers to create
> > > ArrowRecordBatch that's why we keep writer/reader indices in
> > > getFieldBuffers
> > > , we should use getBuffers instead. But during the change, we found
> > another
> > > problem:
> > >
> > > The order of validity and offset buffers are not in the same order in
> > > ListVector(getBuffers's order is right), changing the API in
> > VectorUnloader
> > > creates problems with serialization/deserialization resulting in test
> > > failures in Dremio which would break backward compatibility with
> existing
> > > serialised files.
> > >
> > >
> > > Micah gives a solution but seems doesn't reach consistent in the PR
> > > thread[2
> > > ]:
> > >
> > >1. Remove setReaderWriterIndeces in getFieldBuffers
> > >2. Deprecate getBuffers
> > >3. Introduce a new getIpcBuffers which is unambiguously used for
> > writing
> > >record batches (i.e. in VectorUnloader).
> > >4. Update documentation where it makes sense based on all this
> > >conversation.
> > >
> > >
> > > More details and discussions can be seen from the PR and hope to get
> more
> > > feedback.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Ji Liu
> > >
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-7539
> > >
> > > [2] https://github.com/apache/arrow/pull/6156
> > >
> >
>


Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

2020-08-05 Thread Fan Liya
Hi Ji,

IMO, for the correct order, the validity buffer should precede the offset
buffer (e.g. this is the order used by BaseVariableWidthVector &
BaseLargeVariableWidthVector).
In ListVector#getBuffers, the offset buffer precedes the validity buffer,
so I am a little confused why you say the order of ListVector#getBuffers is
right?

Best,
Liya Fan

On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield 
wrote:

> FWIW, I lack historical context on how these methods evolved, so I'd
> appreciate insight from anyone who has worked on the java codebase for a
> longer period of time.  The current situation seems less then ideal.
>
> On Tue, Aug 4, 2020 at 12:55 AM Ji Liu  wrote:
>
> > Hi all,
> >
> >
> > When I worked on ARROW-7539[1], I met some problems and not sure what's
> the
> > proper way to solve it.
> >
> >
> > This issue was about to avoid set reader/writer indices in
> > FieldVector#getFieldBuffers according to the following reasons:
> >
> > i. getBuffers set reader/writer indices and it's right for the purpose of
> > sending the data over the wire
> >
> > ii. getFieldBuffers is distinct from getBuffers, it should be for getting
> > access to underlying data for higher-performance algorithms
> >
> >
> > Currently in VectorUnloader, we used getFieldBuffers to create
> > ArrowRecordBatch that's why we keep writer/reader indices in
> > getFieldBuffers
> > , we should use getBuffers instead. But during the change, we found
> another
> > problem:
> >
> > The order of validity and offset buffers are not in the same order in
> > ListVector(getBuffers's order is right), changing the API in
> VectorUnloader
> > creates problems with serialization/deserialization resulting in test
> > failures in Dremio which would break backward compatibility with existing
> > serialised files.
> >
> >
> > Micah gives a solution but seems doesn't reach consistent in the PR
> > thread[2
> > ]:
> >
> >1. Remove setReaderWriterIndeces in getFieldBuffers
> >2. Deprecate getBuffers
> >3. Introduce a new getIpcBuffers which is unambiguously used for
> writing
> >record batches (i.e. in VectorUnloader).
> >4. Update documentation where it makes sense based on all this
> >conversation.
> >
> >
> > More details and discussions can be seen from the PR and hope to get more
> > feedback.
> >
> >
> >
> > Thanks,
> >
> > Ji Liu
> >
> >
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-7539
> >
> > [2] https://github.com/apache/arrow/pull/6156
> >
>