RE: How about inet4/inet6/macaddr data types?

2019-04-29 Thread Melik-Adamyan, Areg
If you want to store it and manipulate the best format is integers (or binary) 
- it will allow all the fast operations of masking, subnet querying, etc. but 
text representation will require conversion. 
It highly depends on the use-case, but conversion to pgSQL's inet or cidr from 
integer is very straightforward for integers. Or you can store as a text, then 
the conversion will be done automatically on pgsql side, but operations in 
Arrow, e.g. comparison, hashing or sorting will be costly. 

> -Original Message-
> From: Kohei KaiGai [mailto:kai...@heterodb.com]
> Sent: Monday, April 29, 2019 8:20 PM
> To: dev@arrow.apache.org
> Subject: How about inet4/inet6/macaddr data types?
> 
> Hello folks,
> 
> How about your opinions about network address types support in Apache Arrow
> data format?
> Network address always appears at network logs massively generated by any
> network facilities, and it is a significant information when people analyze 
> their
> backward logs.
> 
> I'm working on Apache Arrow format mapping on PostgreSQL.
> http://heterodb.github.io/pg-strom/arrow_fdw/
> 
> This extension allows to read Arrow files as if PostgreSQL's table using 
> foreign
> table.
> Data types of Arrow shall be mapped to relevant PostgreSQL's data type
> according to the above documentation.
> 
> https://www.postgresql.org/docs/current/datatype-net-types.html
> PostgreSQL supports some network address types and operators.
> For example, we can put a qualifier like:   WHERE addr <<= inet
> '192.168.1.0/24' , to find out all
> the records in the subnet of '192.168.1.0/24'.
> 
> Probably, these three data types are now sufficient for most network
> logs: inet4, inet6 and macaddr.
> * inet4 is 32bit + optional 8bit (for netmask) fixed length array
> * inet6 is 128bit + optional 8bit (for netmask) fixed length array
> * macaddr is 48bit fixed length array.
> 
> I don't favor to map the inetX types on flexible length Binary data type, 
> because
> it takes 32bit offset to indicate 32 or 40bit value, inefficient so much, even
> though PostgreSQL allows to mix inet4/inet6 data types in a same column.
> 
> Thanks,
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei 


[jira] [Created] (ARROW-5242) Arrow doesn't compile cleanly with Visual Studio 2017 Update 9 or later due to narrowing

2019-04-29 Thread Billy Robert O'Neal III (JIRA)
Billy Robert O'Neal III created ARROW-5242:
--

 Summary: Arrow doesn't compile cleanly with Visual Studio 2017 
Update 9 or later due to narrowing
 Key: ARROW-5242
 URL: https://issues.apache.org/jira/browse/ARROW-5242
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Billy Robert O'Neal III


The std::string constructor call here is narrowing wchar_t to char, which emits 
warning C4244 on current releases of Visual Studio: 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/datetime/tz.cpp#L205]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


How about inet4/inet6/macaddr data types?

2019-04-29 Thread Kohei KaiGai
Hello folks,

How about your opinions about network address types support in Apache
Arrow data format?
Network address always appears at network logs massively generated by
any network facilities,
and it is a significant information when people analyze their backward logs.

I'm working on Apache Arrow format mapping on PostgreSQL.
http://heterodb.github.io/pg-strom/arrow_fdw/

This extension allows to read Arrow files as if PostgreSQL's table
using foreign table.
Data types of Arrow shall be mapped to relevant PostgreSQL's data type
according to the above
documentation.

https://www.postgresql.org/docs/current/datatype-net-types.html
PostgreSQL supports some network address types and operators.
For example, we can put a qualifier like:   WHERE addr <<= inet
'192.168.1.0/24' , to find out all
the records in the subnet of '192.168.1.0/24'.

Probably, these three data types are now sufficient for most network
logs: inet4, inet6 and macaddr.
* inet4 is 32bit + optional 8bit (for netmask) fixed length array
* inet6 is 128bit + optional 8bit (for netmask) fixed length array
* macaddr is 48bit fixed length array.

I don't favor to map the inetX types on flexible length Binary data
type, because it takes 32bit offset
to indicate 32 or 40bit value, inefficient so much, even though
PostgreSQL allows to mix inet4/inet6
data types in a same column.

Thanks,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


Re: [Contribution][Proposal] Use Contributors file and Signed-Off-By Process for Arrow

2019-04-29 Thread Wes McKinney
AFAIK no one has been employing systematic IP scanning tools;
generally when there is code reuse in a pull request it is fairly
obvious. It would be interesting to know how large, mature Apache
projects (Apache Hadoop, Apache Spark, etc.) have approached this
problem.

On Mon, Apr 29, 2019 at 5:13 PM Melik-Adamyan, Areg
 wrote:
>
> HI Wes, thanks for the reply. How do the committers and PMC check the IP 
> currently? Is there any standard tool for it that you use?
>
> > -Original Message-
> > From: Wes McKinney [mailto:wesmck...@gmail.com]
> > Sent: Monday, April 29, 2019 4:39 PM
> > To: dev@arrow.apache.org
> > Subject: Re: [Contribution][Proposal] Use Contributors file and 
> > Signed-Off-By
> > Process for Arrow
> >
> > hi Areg,
> >
> > I think this is a question for ASF Legal and not Apache Arrow directly. Some
> > contributors submit a ICLA or CCLA to the project, but broadly it is the
> > responsibility of the Committers and PMC members to steward IP in the
> > project, and one of the parts of the release process is to verify that the
> > software has complied with the ASF's licensing policies [1]
> >
> > Thanks
> > Wes
> >
> > [1]: https://apache.org/legal/resolved.html
> >
> > On Mon, Apr 29, 2019 at 4:27 PM Melik-Adamyan, Areg  > adam...@intel.com> wrote:
> > >
> > > To avoid contamination of the Arrow code with wrong licensed code, which
> > can be accidentally included into arrow, including GPL code, and track the
> > contributions maintainers needs to check actually whether committer has
> > signed the ICLA or CCLA, and listed in the contributors file - which we do 
> > not
> > have. This is needed to set the clean chain of contribution to safeguard 3rd
> > parties. So either let's add CONTRIBUTORS file, or also we can add 
> > "sign-off by
> > process" [1] as it is used in Kernel. The latter will allow single patch
> > contribution without CLA submission. I do not know what are the requirements
> > of the Apache Foundation, but web page does not state sole requirement only
> > on CLA.
> > >
> > > [1] https://ltsi.linuxfoundation.org/software/signed-off-process/
> > >


RE: [Contribution][Proposal] Use Contributors file and Signed-Off-By Process for Arrow

2019-04-29 Thread Melik-Adamyan, Areg
HI Wes, thanks for the reply. How do the committers and PMC check the IP 
currently? Is there any standard tool for it that you use?

> -Original Message-
> From: Wes McKinney [mailto:wesmck...@gmail.com]
> Sent: Monday, April 29, 2019 4:39 PM
> To: dev@arrow.apache.org
> Subject: Re: [Contribution][Proposal] Use Contributors file and Signed-Off-By
> Process for Arrow
> 
> hi Areg,
> 
> I think this is a question for ASF Legal and not Apache Arrow directly. Some
> contributors submit a ICLA or CCLA to the project, but broadly it is the
> responsibility of the Committers and PMC members to steward IP in the
> project, and one of the parts of the release process is to verify that the
> software has complied with the ASF's licensing policies [1]
> 
> Thanks
> Wes
> 
> [1]: https://apache.org/legal/resolved.html
> 
> On Mon, Apr 29, 2019 at 4:27 PM Melik-Adamyan, Areg  adam...@intel.com> wrote:
> >
> > To avoid contamination of the Arrow code with wrong licensed code, which
> can be accidentally included into arrow, including GPL code, and track the
> contributions maintainers needs to check actually whether committer has
> signed the ICLA or CCLA, and listed in the contributors file - which we do not
> have. This is needed to set the clean chain of contribution to safeguard 3rd
> parties. So either let's add CONTRIBUTORS file, or also we can add "sign-off 
> by
> process" [1] as it is used in Kernel. The latter will allow single patch
> contribution without CLA submission. I do not know what are the requirements
> of the Apache Foundation, but web page does not state sole requirement only
> on CLA.
> >
> > [1] https://ltsi.linuxfoundation.org/software/signed-off-process/
> >


[jira] [Created] (ARROW-5241) [Python] Add option to disable writing statistics

2019-04-29 Thread Deepak Majeti (JIRA)
Deepak Majeti created ARROW-5241:


 Summary: [Python] Add option to disable writing statistics
 Key: ARROW-5241
 URL: https://issues.apache.org/jira/browse/ARROW-5241
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Deepak Majeti
 Fix For: 0.14.0


C++  Parquet API exposes an option to disable writing statistics when writing a 
Parquet file.
It will be useful to expose this API in the Python Arrow API as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Contribution][Proposal] Use Contributors file and Signed-Off-By Process for Arrow

2019-04-29 Thread Wes McKinney
hi Areg,

I think this is a question for ASF Legal and not Apache Arrow
directly. Some contributors submit a ICLA or CCLA to the project, but
broadly it is the responsibility of the Committers and PMC members to
steward IP in the project, and one of the parts of the release process
is to verify that the software has complied with the ASF's licensing
policies [1]

Thanks
Wes

[1]: https://apache.org/legal/resolved.html

On Mon, Apr 29, 2019 at 4:27 PM Melik-Adamyan, Areg
 wrote:
>
> To avoid contamination of the Arrow code with wrong licensed code, which can 
> be accidentally included into arrow, including GPL code, and track the 
> contributions maintainers needs to check actually whether committer has 
> signed the ICLA or CCLA, and listed in the contributors file - which we do 
> not have. This is needed to set the clean chain of contribution to safeguard 
> 3rd parties. So either let's add CONTRIBUTORS file, or also we can add 
> "sign-off by process" [1] as it is used in Kernel. The latter will allow 
> single patch contribution without CLA submission. I do not know what are the 
> requirements of the Apache Foundation, but web page does not state sole 
> requirement only on CLA.
>
> [1] https://ltsi.linuxfoundation.org/software/signed-off-process/
>


[jira] [Created] (ARROW-5240) [C++][CI] cmake_format 0.5.0 appears to fail the build

2019-04-29 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-5240:
--

 Summary: [C++][CI] cmake_format 0.5.0 appears to fail the build
 Key: ARROW-5240
 URL: https://issues.apache.org/jira/browse/ARROW-5240
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Micah Kornfield
Assignee: Micah Kornfield






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[Contribution][Proposal] Use Contributors file and Signed-Off-By Process for Arrow

2019-04-29 Thread Melik-Adamyan, Areg
To avoid contamination of the Arrow code with wrong licensed code, which can be 
accidentally included into arrow, including GPL code, and track the 
contributions maintainers needs to check actually whether committer has signed 
the ICLA or CCLA, and listed in the contributors file - which we do not have. 
This is needed to set the clean chain of contribution to safeguard 3rd parties. 
So either let's add CONTRIBUTORS file, or also we can add "sign-off by process" 
[1] as it is used in Kernel. The latter will allow single patch contribution 
without CLA submission. I do not know what are the requirements of the Apache 
Foundation, but web page does not state sole requirement only on CLA.

[1] https://ltsi.linuxfoundation.org/software/signed-off-process/



Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-29 Thread Wes McKinney
On Mon, Apr 29, 2019 at 2:59 PM Micah Kornfield  wrote:
>
> >
> > > * The _actual_ dictionary values for a particular Array must be stored
> > > somewhere and lifetime managed. I propose to put these as a single
> > > entry in ArrayData::child_data [4]. An alternative to this would be to
> > > modify ArrayData to have a dictionary field that would be unused
> > > except for encoded datasets
> > `child_data` is supposed to mirror more or less the order of buffers in
> > an IPC stream, right?  Therefore I would favour a dedicated dictionary
> > field (also makes fetching the dictionary trivial)
>
>
> Without seeing the code, I'd also be in favor a separate field.  Was there
> a reason you were in favor of adding another element to child_data?
>

I guess I should write some microbenchmarks to gauge the impact (if
any) of adding another non-trivial struct member. That was the main
reason (for the cost to non-dictionary arrays to be zero)

> On Mon, Apr 29, 2019 at 11:53 AM Antoine Pitrou  wrote:
>
> >
> > Hi Wes,
> >
> > Le 29/04/2019 à 20:10, Wes McKinney a écrit :
> > >
> > > * Receiving a record batch schema without the dictionaries attached
> > > (e.g. in Arrow Flight), see also experimental patch [2]
> >
> > Note that this was finally done in a separate PR, and only required
> > changes in the IPC implementation.
> >
> > > Here is my proposal to reconcile these issues in C++
> > >
> > > * Add a new "synthetic" data type called "variable dictionary" to be
> > > used alongside the existing "static dictionary" type. An instance of
> > > VariableDictionaryType (name TBD) will not know what the dictionary
> > > is, only the data type of the dictionary (e.g. utf8()) and the index
> > > type (e.g. int32())
> >
> > Interesting idea.  I'm curious to see a PR.
> >
> > > * Define common abstract API for instances of static vs variable
> > > dictionary arrays. Mainly this means making
> > > DictionaryArray::dictionary [3] virtual
> >
> > I'm not sure this is required, especially if the following is implemented:
> >
> > > * The _actual_ dictionary values for a particular Array must be stored
> > > somewhere and lifetime managed. I propose to put these as a single
> > > entry in ArrayData::child_data [4]. An alternative to this would be to
> > > modify ArrayData to have a dictionary field that would be unused
> > > except for encoded datasets
> >
> > `child_data` is supposed to mirror more or less the order of buffers in
> > an IPC stream, right?  Therefore I would favour a dedicated dictionary
> > field (also makes fetching the dictionary trivial).
> >
> > > This proposal does create some ongoing implementation and maintenance
> > > burden, but to that I would make these points:
> > >
> > > * Many algorithms will dispatch from one type to the other (probably
> > > static dispatching to the variable path), so there will not be a need
> > > to implement multiple times in most cases
> >
> > Sounds believable indeed.
> >
> > Regards
> >
> > Antoine.
> >


Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-29 Thread Micah Kornfield
>
> > * The _actual_ dictionary values for a particular Array must be stored
> > somewhere and lifetime managed. I propose to put these as a single
> > entry in ArrayData::child_data [4]. An alternative to this would be to
> > modify ArrayData to have a dictionary field that would be unused
> > except for encoded datasets
> `child_data` is supposed to mirror more or less the order of buffers in
> an IPC stream, right?  Therefore I would favour a dedicated dictionary
> field (also makes fetching the dictionary trivial)


Without seeing the code, I'd also be in favor a separate field.  Was there
a reason you were in favor of adding another element to child_data?

On Mon, Apr 29, 2019 at 11:53 AM Antoine Pitrou  wrote:

>
> Hi Wes,
>
> Le 29/04/2019 à 20:10, Wes McKinney a écrit :
> >
> > * Receiving a record batch schema without the dictionaries attached
> > (e.g. in Arrow Flight), see also experimental patch [2]
>
> Note that this was finally done in a separate PR, and only required
> changes in the IPC implementation.
>
> > Here is my proposal to reconcile these issues in C++
> >
> > * Add a new "synthetic" data type called "variable dictionary" to be
> > used alongside the existing "static dictionary" type. An instance of
> > VariableDictionaryType (name TBD) will not know what the dictionary
> > is, only the data type of the dictionary (e.g. utf8()) and the index
> > type (e.g. int32())
>
> Interesting idea.  I'm curious to see a PR.
>
> > * Define common abstract API for instances of static vs variable
> > dictionary arrays. Mainly this means making
> > DictionaryArray::dictionary [3] virtual
>
> I'm not sure this is required, especially if the following is implemented:
>
> > * The _actual_ dictionary values for a particular Array must be stored
> > somewhere and lifetime managed. I propose to put these as a single
> > entry in ArrayData::child_data [4]. An alternative to this would be to
> > modify ArrayData to have a dictionary field that would be unused
> > except for encoded datasets
>
> `child_data` is supposed to mirror more or less the order of buffers in
> an IPC stream, right?  Therefore I would favour a dedicated dictionary
> field (also makes fetching the dictionary trivial).
>
> > This proposal does create some ongoing implementation and maintenance
> > burden, but to that I would make these points:
> >
> > * Many algorithms will dispatch from one type to the other (probably
> > static dispatching to the variable path), so there will not be a need
> > to implement multiple times in most cases
>
> Sounds believable indeed.
>
> Regards
>
> Antoine.
>


Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-29 Thread Antoine Pitrou


Hi Wes,

Le 29/04/2019 à 20:10, Wes McKinney a écrit :
> 
> * Receiving a record batch schema without the dictionaries attached
> (e.g. in Arrow Flight), see also experimental patch [2]

Note that this was finally done in a separate PR, and only required
changes in the IPC implementation.

> Here is my proposal to reconcile these issues in C++
> 
> * Add a new "synthetic" data type called "variable dictionary" to be
> used alongside the existing "static dictionary" type. An instance of
> VariableDictionaryType (name TBD) will not know what the dictionary
> is, only the data type of the dictionary (e.g. utf8()) and the index
> type (e.g. int32())

Interesting idea.  I'm curious to see a PR.

> * Define common abstract API for instances of static vs variable
> dictionary arrays. Mainly this means making
> DictionaryArray::dictionary [3] virtual

I'm not sure this is required, especially if the following is implemented:

> * The _actual_ dictionary values for a particular Array must be stored
> somewhere and lifetime managed. I propose to put these as a single
> entry in ArrayData::child_data [4]. An alternative to this would be to
> modify ArrayData to have a dictionary field that would be unused
> except for encoded datasets

`child_data` is supposed to mirror more or less the order of buffers in
an IPC stream, right?  Therefore I would favour a dedicated dictionary
field (also makes fetching the dictionary trivial).

> This proposal does create some ongoing implementation and maintenance
> burden, but to that I would make these points:
> 
> * Many algorithms will dispatch from one type to the other (probably
> static dispatching to the variable path), so there will not be a need
> to implement multiple times in most cases

Sounds believable indeed.

Regards

Antoine.


[jira] [Created] (ARROW-5239) Add support for interval types in javascript

2019-04-29 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-5239:
--

 Summary: Add support for interval types in javascript
 Key: ARROW-5239
 URL: https://issues.apache.org/jira/browse/ARROW-5239
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Reporter: Micah Kornfield


Update integration_test.py to include interval tests for JSTest once this is 
done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-04-29 Thread Wes McKinney
I'm also curious which APIs are particularly problematic for
performance. In ARROW-1833 [1] and some related discussions there was
the suggestion of adding methods like getUnsafe, so this would be like
get(i) [2] but without checking the validity bitmap

[1] : https://issues.apache.org/jira/browse/ARROW-1833
[2]: 
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/Float8Vector.java#L99

On Mon, Apr 29, 2019 at 1:05 PM Micah Kornfield  wrote:
>
> Thanks for the design.   Personally, I'm not a huge fan of creating a
> parallel classes for every vector type, this ends up being confusing for
> developers and adds a lot of boiler plate.  I wonder if you could use a
> similar approach that the memory module uses for turning bounds checking
> on/off [1].
>
> Also, I think there was a comment on the JIRA, but are there any benchmarks
> to show the expected improvements?  My limited understanding is that for
> small methods the JVM's JIT should inline them anyways [2] , so it is not
> clear how much this will improve performance.
>
>
> Thanks,
> Micah
>
>
> [1]
> https://github.com/apache/arrow/blob/master/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java
> [2]
> https://stackoverflow.com/questions/24923040/do-modern-java-compilers-jvm-inline-functions-methods-which-are-called-exactly-f
>
> On Sun, Apr 28, 2019 at 2:50 AM Fan Liya  wrote:
>
> > Hi all,
> >
> > We are proposing a new set of APIs in Arrow - unsafe vector APIs. The
> > general ideas is attached below, and also accessible from our online
> > document
> > .
> > Please give your valuable comments by directly commenting in our online
> > document
> > ,
> > or relaying this email thread.
> >
> > Thank you so much in advance.
> >
> > Best,
> > Liya Fan
> >
> > Support Fast/Unsafe Vector APIs for Arrow Background
> >
> > In our effort to support columnar data format in Apache Flink, we chose
> > Apache Arrow as the basic data structure. Arrow greatly simplifies the
> > support of the columnar data format. However, for many scenarios, we find
> > the performance unacceptable. Our investigation shows the reason is that,
> > there are too many redundant checks and computations in current Arrow API.
> >
> >
> >
> > For example, the following figures shows that in a single call to
> > Float8Vector.get(int) method (this is one of the most frequently used APIs
> > in Flink computation),  there are 20+ method invocations.
> >
> >
> > [image: image.png]
> >
> >
> >
> >
> >
> > There are many other APIs with similar problems. The redundant checks and
> > computations impact performance severely. According to our evaluation, the
> > performance may degrade by two or three orders of magnitude.
> > Our Proposal
> >
> > For many scenarios, the checks can be avoided, if the application
> > developers can guarantee that all checks will pass. So our proposal is to
> > provide some light-weight APIs. The APIs are also named *unsafe APIs*, in
> > the sense that that skip most of the checks (not safe) to improve the
> > performance.
> >
> >
> >
> > In the light-weight APIs, we only provide minimum checks, or avoid checks
> > at all. The application owner can still develop and debug their code using
> > the original safe APIs. Once all bugs have been fixed, they can switch to
> > unsafe APIs in the final version of their products and enjoy the high
> > performance.
> > Our Design
> >
> > Our goal is to include unsafe vector APIs in Arrow code base, and allow
> > our customers switching to the new unsafe APIs, without being aware of it,
> > except for the high performance. To achieve this goal, we make the
> > following design choices:
> > Vector Class Hierarchy
> >
> > Each unsafe vector is the subclass of the safe vector. For example, the
> > unsafe Float8Vector is a subclass of org.apache.arrow.vector.Float8Vector:
> >
> >
> >
> > package org.apache.arrow.vector.unsafe;
> >
> >
> >
> > public class Float8Vector extends org.apache.arrow.vector.Float8Vector
> >
> >
> >
> > So the safe vector acts as a façade of the unsafe vector, and through
> > polymorphism, the users may not be aware of which type of vector he/she is
> > working with. In addition, the common logics can be reused in the unsafe
> > vectors, and we only need to override get/set related methods.
> > Vector Creation
> >
> > We use factory methods to create each type of vectors. Compared with
> > vector constructors, the factory methods take one more parameter, the
> > vectorType:
> >
> >
> >
> > public class VectorFactory {
> >
> >   public static Float8Vector createFloat8Vector(VectorType vectorType,
> > String name, BufferAllocator allocator);
> >
> > }
> >
> >
> >
> > VectorType is an enum to separate safe vectors from unsafe ones:
> >
> >
> >
> > public 

[DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-29 Thread Wes McKinney
hi all,

There have been many discussions in passing on various issues and JIRA
tickets over the last months and years about how to manage
dictionary-encoded columnar arrays in-memory in C++. Here's a list of
some problems we have encountered:

* Dictionaries that may differ from one record batch to another, but
represent semantically parts of the same dataset. For example, each
row group in Parquet format may have different dictionaries, and
decoding from encoded to "dense" representation may be undesirable

* Support for dictionary "deltas" in the IPC protocol, using the
isDelta flag on DictionaryBatch [1]. It's conceivable we might want to
allow dictionaries to change altogether in the IPC protocol (where a
dictionary id appears again -- a replacement -- but isDelta = false)

* Receiving a record batch schema without the dictionaries attached
(e.g. in Arrow Flight), see also experimental patch [2]

* Encoded datasets may be produced in a distributed system where
"reconciling" the dictionary is not computationally feasible or
desirable (particularly if the encoded data is just going to be
aggregated)

The reason that these are "problems" has to do with the way that we
represent dictionary encoded data in C++. We have created a
"synthetic" DictionaryType object that is used for Array/Column types
or for an entry in arrow::Schema. The DictionaryType wraps the index
type (signed integer type) and the dictionary itself. So from Python
we have

>>> t = pa.dictionary(pa.int8(), pa.array(['a', 'b', 'c', 'd']))
>>> t
DictionaryType(dictionary)
>>> t.dictionary

[
  "a",
  "b",
  "c",
  "d"
]
>>> t.index_type
DataType(int8)

This is useful in languages like Python because we have the notion of
Categorical types which are a combination of a static dictionary and a
contiguous array of indices.

It creates problems when we have changing dictionaries, because the
"schema" under this in-memory construction may change from record
batch to record batch. This means that types are deemed "unequal"
according to many code paths we've written.

To consider solutions to this problem, I want to first point out that
the way we are dealing with dictionary-encoded data in memory is a
purely semantic construct for C++ and the binding languages.
"Dictionary" is not a data type as all according to the Arrow IPC
protocol -- it is a method is transferring encoded / compressed data,
and the handling thereof is left to the implementations. There are
benefits to the method we are using now, in particular it makes
dynamic dispatch (including the visitor pattern, and virtual
functions) based on whether something is encoded or not simple. It
also leads to simple round trips of Categorical types from libraries
like pandas.

Here is my proposal to reconcile these issues in C++

* Add a new "synthetic" data type called "variable dictionary" to be
used alongside the existing "static dictionary" type. An instance of
VariableDictionaryType (name TBD) will not know what the dictionary
is, only the data type of the dictionary (e.g. utf8()) and the index
type (e.g. int32())

* Define common abstract API for instances of static vs variable
dictionary arrays. Mainly this means making
DictionaryArray::dictionary [3] virtual

* The _actual_ dictionary values for a particular Array must be stored
somewhere and lifetime managed. I propose to put these as a single
entry in ArrayData::child_data [4]. An alternative to this would be to
modify ArrayData to have a dictionary field that would be unused
except for encoded datasets

This proposal does create some ongoing implementation and maintenance
burden, but to that I would make these points:

* Many algorithms will dispatch from one type to the other (probably
static dispatching to the variable path), so there will not be a need
to implement multiple times in most cases

* In some algorithms, we may observe a stream of dictionary encoded
arrays, and we need only obtain the current dictionary as well as the
knowledge of whether it is the same as previous dictionaries. In hash
aggregations and other analytics I think we need to implement by
default under the assumption of dynamic/variable dictionaries

I haven't conceived of any other ideas (after much contemplation) how
to algebraically accommodate these use cases in our object model so
interested in the opinions of others. As a first use case for this I
would be personally looking to address reads of encoded data from
Parquet format without an intermediate pass through dense format
(which can be slow and wasteful for heavily compressed string data)

Thanks,
Wes

[1]: 
https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches
[2]: https://github.com/apache/arrow/pull/4067
[3]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L968
[4]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L208


Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-04-29 Thread Micah Kornfield
Thanks for the design.   Personally, I'm not a huge fan of creating a
parallel classes for every vector type, this ends up being confusing for
developers and adds a lot of boiler plate.  I wonder if you could use a
similar approach that the memory module uses for turning bounds checking
on/off [1].

Also, I think there was a comment on the JIRA, but are there any benchmarks
to show the expected improvements?  My limited understanding is that for
small methods the JVM's JIT should inline them anyways [2] , so it is not
clear how much this will improve performance.


Thanks,
Micah


[1]
https://github.com/apache/arrow/blob/master/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java
[2]
https://stackoverflow.com/questions/24923040/do-modern-java-compilers-jvm-inline-functions-methods-which-are-called-exactly-f

On Sun, Apr 28, 2019 at 2:50 AM Fan Liya  wrote:

> Hi all,
>
> We are proposing a new set of APIs in Arrow - unsafe vector APIs. The
> general ideas is attached below, and also accessible from our online
> document
> .
> Please give your valuable comments by directly commenting in our online
> document
> ,
> or relaying this email thread.
>
> Thank you so much in advance.
>
> Best,
> Liya Fan
>
> Support Fast/Unsafe Vector APIs for Arrow Background
>
> In our effort to support columnar data format in Apache Flink, we chose
> Apache Arrow as the basic data structure. Arrow greatly simplifies the
> support of the columnar data format. However, for many scenarios, we find
> the performance unacceptable. Our investigation shows the reason is that,
> there are too many redundant checks and computations in current Arrow API.
>
>
>
> For example, the following figures shows that in a single call to
> Float8Vector.get(int) method (this is one of the most frequently used APIs
> in Flink computation),  there are 20+ method invocations.
>
>
> [image: image.png]
>
>
>
>
>
> There are many other APIs with similar problems. The redundant checks and
> computations impact performance severely. According to our evaluation, the
> performance may degrade by two or three orders of magnitude.
> Our Proposal
>
> For many scenarios, the checks can be avoided, if the application
> developers can guarantee that all checks will pass. So our proposal is to
> provide some light-weight APIs. The APIs are also named *unsafe APIs*, in
> the sense that that skip most of the checks (not safe) to improve the
> performance.
>
>
>
> In the light-weight APIs, we only provide minimum checks, or avoid checks
> at all. The application owner can still develop and debug their code using
> the original safe APIs. Once all bugs have been fixed, they can switch to
> unsafe APIs in the final version of their products and enjoy the high
> performance.
> Our Design
>
> Our goal is to include unsafe vector APIs in Arrow code base, and allow
> our customers switching to the new unsafe APIs, without being aware of it,
> except for the high performance. To achieve this goal, we make the
> following design choices:
> Vector Class Hierarchy
>
> Each unsafe vector is the subclass of the safe vector. For example, the
> unsafe Float8Vector is a subclass of org.apache.arrow.vector.Float8Vector:
>
>
>
> package org.apache.arrow.vector.unsafe;
>
>
>
> public class Float8Vector extends org.apache.arrow.vector.Float8Vector
>
>
>
> So the safe vector acts as a façade of the unsafe vector, and through
> polymorphism, the users may not be aware of which type of vector he/she is
> working with. In addition, the common logics can be reused in the unsafe
> vectors, and we only need to override get/set related methods.
> Vector Creation
>
> We use factory methods to create each type of vectors. Compared with
> vector constructors, the factory methods take one more parameter, the
> vectorType:
>
>
>
> public class VectorFactory {
>
>   public static Float8Vector createFloat8Vector(VectorType vectorType,
> String name, BufferAllocator allocator);
>
> }
>
>
>
> VectorType is an enum to separate safe vectors from unsafe ones:
>
>
>
> public enum VectorType {
>
>   SAFE,
>
>   UNSAFE
>
> }
>
>
>
> With the factory methods, the old way of creating vectors by constructors
> can be gradually depreciated.
> Vector Implementation
>
> As discussed above, unsafe vectors mainly override get/set methods. For
> get methods, we directly operate on the off-heap memory, without any check:
>
>
>
> public double get(int index) {
>
> return
> Double.longBitsToDouble(PlatformDependent.getLong(valueBuffer.memoryAddress()
> + (index << TYPE_LOG2_WIDTH)));
>
> }
>
>
>
> Note that the PlatformDependent API is only 2 stack layers above the
> underlying UNSAFE method call.
>
>
>
> For set methods, we still need to set the validity bit. However, this is
> through an unsafe method 

[jira] [Created] (ARROW-5238) [Python] Improve usability of pyarrow.dictionary function

2019-04-29 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5238:
---

 Summary: [Python] Improve usability of pyarrow.dictionary function
 Key: ARROW-5238
 URL: https://issues.apache.org/jira/browse/ARROW-5238
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.14.0


{code}
>>> pa.dictionary('int8', ['a', 'b', 'c', 'd'])
Traceback (most recent call last):
  File "", line 1, in 
TypeError: Argument 'index_type' has incorrect type (expected 
pyarrow.lib.DataType, got str)
>>> pa.dictionary(pa.int8(), ['a', 'b', 'c', 'd'])
Traceback (most recent call last):
  File "", line 1, in 
TypeError: Argument 'dict_values' has incorrect type (expected 
pyarrow.lib.Array, got list)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] C++ Filesystem abstraction

2019-04-29 Thread Wes McKinney
hi Antoine,

Thank you for starting this discussion.

I left some comments on the PR. I had been looking previously at
TensorFlow's file system APIs ([1], and various implementations) for
some possible guidance around this, though since Arrow is intended as
development platform / reusable set of libraries our use cases are a
bit more general purpose than TF.

To Romain and R folks and Kou and the Ruby folks, it would be great to
get your feedback on this as well since you can make use of this
functionality in R, C GLib, and Ruby.

- Wes

[1] 
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/file_system.h

On Mon, Apr 29, 2019 at 11:26 AM Antoine Pitrou  wrote:
>
>
> Hello,
>
> For the datasets project (*), one requirement is for Arrow to grow a
> filesystem abstraction.  The aim is to access various kinds of storage
> systems (local filesystem, S3, HadoopFS...) with a single API.
> Hopefully, the API can be made good enough to avoid inefficiencies.
>
> I've pushed a draft PR with a simple API proposal in:
> https://github.com/apache/arrow/pull/4225
>
> This PR is meant as a starting point for discussion.  If you have any
> insight or experience on the subject, please review and give
> suggestions / comments.
>
> (*)https://docs.google.com/document/d/1DCPwA6gF-Uy-rlHoVL60j-I-b1L7n1aqKLie2L3U50k/edit
>
> Regards
>
> Antoine.
>
>


[DISCUSS] C++ Filesystem abstraction

2019-04-29 Thread Antoine Pitrou


Hello,

For the datasets project (*), one requirement is for Arrow to grow a
filesystem abstraction.  The aim is to access various kinds of storage
systems (local filesystem, S3, HadoopFS...) with a single API.
Hopefully, the API can be made good enough to avoid inefficiencies.

I've pushed a draft PR with a simple API proposal in:
https://github.com/apache/arrow/pull/4225

This PR is meant as a starting point for discussion.  If you have any
insight or experience on the subject, please review and give
suggestions / comments.

(*)https://docs.google.com/document/d/1DCPwA6gF-Uy-rlHoVL60j-I-b1L7n1aqKLie2L3U50k/edit

Regards

Antoine.




[jira] [Created] (ARROW-5237) [Python] pandas_version key in pandas metadata no longer populated

2019-04-29 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5237:


 Summary: [Python] pandas_version key in pandas metadata no longer 
populated
 Key: ARROW-5237
 URL: https://issues.apache.org/jira/browse/ARROW-5237
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.14.0


While looking at the pandas metadata, I noticed that the {{pandas_version}} 
field now is None. I suppose this is due to the recent refactoring of the 
pandas api compat (https://github.com/apache/arrow/pull/3893). PR coming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5236) hdfs.connect() is trying to load libjvm in windows

2019-04-29 Thread Kamaraju (JIRA)
Kamaraju created ARROW-5236:
---

 Summary: hdfs.connect() is trying to load libjvm in windows
 Key: ARROW-5236
 URL: https://issues.apache.org/jira/browse/ARROW-5236
 Project: Apache Arrow
  Issue Type: Bug
 Environment: Windows 7 Enterprise, pyarrow 0.13.0
Reporter: Kamaraju


This issue was originally reported at 
[https://github.com/apache/arrow/issues/4215] . Raising a Jira as per Wes 
McKinney's request.

Summary:
 The following script
{code}
$ cat expt2.py
import pyarrow as pa
fs = pa.hdfs.connect()
{code}
tries to load libjvm in windows 7 which is not expected.
{noformat}
$ python ./expt2.py
Traceback (most recent call last):
  File "./expt2.py", line 3, in 
fs = pa.hdfs.connect()
  File 
"C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py",
 line 183, in connect
extra_conf=extra_conf)
  File 
"C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py",
 line 37, in __init__
self._connect(host, port, user, kerb_ticket, driver, extra_conf)
  File "pyarrow\io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect
  File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unable to load libjvm
{noformat}
There is no libjvm file in Windows Java installation.
{noformat}
$ echo $JAVA_HOME
C:\Progra~1\Java\jdk1.8.0_141

$ find $JAVA_HOME -iname '*libjvm*'

{noformat}
I see the libjvm error with both 0.11.1 and 0.13.0 versions of pyarrow.

Steps to reproduce the issue (with more details):

Create the environment
{noformat}
$ cat scratch_py36_pyarrow.yml
name: scratch_py36_pyarrow
channels:
  - defaults
dependencies:
  - python=3.6.8
  - pyarrow
{noformat}
{noformat}
$ conda env create -f scratch_py36_pyarrow.yml
{noformat}
Apply the following patch to lib/site-packages/pyarrow/hdfs.py . I had to do 
this since the Hadoop installation that comes with MapR <[https://mapr.com/]> 
windows client only has $HADOOP_HOME/bin/hadoop.cmd . There is no file named 
$HADOOP_HOME/bin/hadoop and so the subsequent subprocess.check_output call 
fails with FileNotFoundError if this patch is not applied.
{noformat}
$ cat ~/x/patch.txt
131c131
< hadoop_bin = '{0}/bin/hadoop'.format(os.environ['HADOOP_HOME'])
---
> hadoop_bin = '{0}/bin/hadoop.cmd'.format(os.environ['HADOOP_HOME'])

$ patch 
/c/ProgramData/Continuum/Anaconda/envs/scratch_py36_pyarrow/lib/site-packages/pyarrow/hdfs.py
 ~/x/patch.txt
patching file 
/c/ProgramData/Continuum/Anaconda/envs/scratch_py36_pyarrow/lib/site-packages/pyarrow/hdfs.py
{noformat}
Activate the environment
{noformat}
$ source activate scratch_py36_pyarrow
{noformat}
Sample script
{noformat}
$ cat expt2.py
import pyarrow as pa
fs = pa.hdfs.connect()
{noformat}
Execute the script
{noformat}
$ python ./expt2.py
Traceback (most recent call last):
  File "./expt2.py", line 3, in 
fs = pa.hdfs.connect()
  File 
"C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py",
 line 183, in connect
extra_conf=extra_conf)
  File 
"C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py",
 line 37, in __init__
self._connect(host, port, user, kerb_ticket, driver, extra_conf)
  File "pyarrow\io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect
  File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unable to load libjvm
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5235) [C++] RAPIDJSON_INCLUDE_DIR not set (Windows + Anaconda)

2019-04-29 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5235:
-

 Summary: [C++] RAPIDJSON_INCLUDE_DIR not set (Windows + Anaconda)
 Key: ARROW-5235
 URL: https://issues.apache.org/jira/browse/ARROW-5235
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


I'm trying to build Arrow in debug mode on Windows with some dependencies 
installed via conda. Unfortunately I ultimately get the following error:
{code}
[...]
-- RapidJSON found. Headers: C:/Miniconda3/envs/arrow/Library/include
[...]
-- Could NOT find Backtrace (missing: Backtrace_LIBRARY Backtrace_INCLUDE_DIR)
CMake Error: The following variables are used in this project, but they are set
to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake file
s:
RAPIDJSON_INCLUDE_DIR
   used as include directory in directory C:/t/arrow/cpp
[ etc. ]
{code}

RapidJSON 1.1.0 is installed from Anaconda.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5234) [Rust] [DataFusion] Create Python bindings for DataFusion

2019-04-29 Thread Andy Grove (JIRA)
Andy Grove created ARROW-5234:
-

 Summary: [Rust] [DataFusion] Create Python bindings for DataFusion
 Key: ARROW-5234
 URL: https://issues.apache.org/jira/browse/ARROW-5234
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
 Fix For: 0.14.0


As a Python developer, I would like to be able to run SQL queries using 
DataFusion as a native Python module.

This request actually came in from a Reddit user, and I according to 
[https://github.com/PyO3/pyo3] it should be simple to achieve.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5233) [Go] migrate to new flatbuffers-v0.11.0

2019-04-29 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5233:
--

 Summary: [Go] migrate to new flatbuffers-v0.11.0
 Key: ARROW-5233
 URL: https://issues.apache.org/jira/browse/ARROW-5233
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Sebastien Binet
Assignee: Sebastien Binet


migrating to v0.11.0 improves the generated Go code (better handling of 
booleans and enums)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5232) [Java] value vector size increases rapidly in case of clear/setSafe loop

2019-04-29 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-5232:
-

 Summary: [Java] value vector size increases rapidly in case of 
clear/setSafe loop
 Key: ARROW-5232
 URL: https://issues.apache.org/jira/browse/ARROW-5232
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5231) [Java] Arrow Java can't read union vector from ArrowStreamReader written by its own bugs

2019-04-29 Thread Chaokun Yang (JIRA)
Chaokun Yang created ARROW-5231:
---

 Summary: [Java] Arrow Java can't  read union vector from 
ArrowStreamReader written by its own bugs
 Key: ARROW-5231
 URL: https://issues.apache.org/jira/browse/ARROW-5231
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
 Environment: Mac OS 10.13.6, Arrow 0.13.0, JDK8
Reporter: Chaokun Yang


Similar  to https://issues.apache.org/jira/browse/ARROW-5230, when I write 
union data using  ArrowStreamWriter in java, I can't read it back using 
ArrowStreamReader in java. The exception is:
{quote}Exception in thread "main" java.lang.IllegalArgumentException: not all 
nodes and buffers were consumed. nodes: [ArrowFieldNode [length=100, 
nullCount=0]] buffers: [ArrowBuf[14], udle: [7 104..117], ArrowBuf[15], udle: 
[7 120..520]]
 at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:64)
 at 
org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:219)
 at 
org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:121)
{quote}
The code to reproduce this exception is:

 
{code:java}
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.FieldVector;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.complex.UnionVector;
import org.apache.arrow.vector.dictionary.DictionaryProvider;
import org.apache.arrow.vector.holders.NullableIntHolder;
import org.apache.arrow.vector.ipc.ArrowStreamReader;
import org.apache.arrow.vector.ipc.ArrowStreamWriter;
import org.apache.arrow.vector.types.UnionMode;
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;

import java.io.ByteArrayInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Collections;
import java.util.List;

public class UnionTest {

public static void writeUnionBatch(OutputStream os) throws IOException {
int[] typeIds = new int[]{
ArrowType.ArrowTypeID.Int.ordinal()};
ArrowType.Union union = new ArrowType.Union(UnionMode.Sparse, typeIds);
Field field = new Field("f1", FieldType.nullable(union), null);
List fields = Collections.singletonList(field);
Schema schema = new Schema(fields);
VectorSchemaRoot root = VectorSchemaRoot.create(schema, new 
RootAllocator(Integer.MAX_VALUE));
DictionaryProvider.MapDictionaryProvider provider = new 
DictionaryProvider.MapDictionaryProvider();
ArrowStreamWriter writer = new ArrowStreamWriter(root, provider, os);
writer.start();
for (int i = 0; i < 2; i++) {
root.setRowCount(100);
List vectors = root.getFieldVectors();
UnionVector vector = (UnionVector) vectors.get(0);
fillVector(vector, 100);
for (int j = 0; j < 100; j++) {
if (!vector.isNull(j)) {
System.out.println(vector.getObject(j));
}
}
writer.writeBatch();
}
writer.end();
writer.close();
}

private static void fillVector(UnionVector vector, int batchSize) {
vector.setInitialCapacity(batchSize);
vector.allocateNew();
for (int i = 0; i < batchSize; i++) {
NullableIntHolder intHolder = new NullableIntHolder();
intHolder.isSet = 1;
intHolder.value = i;
vector.setSafe(i, intHolder);
}
vector.setValueCount(batchSize);
}


public static void main(String[] args) throws IOException {
try(FileOutputStream fos = new FileOutputStream("result/union.arrow")) {
writeUnionBatch(fos);
System.out.println("write succeed");
fos.flush();
}

RootAllocator allocator = new RootAllocator(10);
ByteArrayInputStream in = new 
ByteArrayInputStream(Files.readAllBytes(Paths.get("result/union.arrow")));
ArrowStreamReader reader = new ArrowStreamReader(in, allocator);
reader.loadNextBatch();
}
}

{code}
And it can't read union data generated by python, as is reported in 
https://issues.apache.org/jira/browse/ARROW-1692. 

It seems strange arrow java can't read union data generated by its own. Is 
there any format gap between arrow java UnionVector write and read?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5230) [Java] Read Struct Vector from ArrowStreamReader bugs

2019-04-29 Thread Chaokun Yang (JIRA)
Chaokun Yang created ARROW-5230:
---

 Summary: [Java] Read Struct Vector from ArrowStreamReader bugs
 Key: ARROW-5230
 URL: https://issues.apache.org/jira/browse/ARROW-5230
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
 Environment: Mac OS 10.13.6, Arrow 0.13.0, JDK8
Reporter: Chaokun Yang


After writing struct vector using ArrowStreamWriter to a file, read it back 
using ArrowStreamReader throws exception:
{quote}Exception in thread "main" java.lang.IllegalArgumentException: not all 
nodes and buffers were consumed. nodes: [ArrowFieldNode [length=100, 
nullCount=0], ArrowFieldNode [length=100, nullCount=0]] buffers: [ArrowBuf[26], 
udle: [11 16..29], ArrowBuf[27], udle: [11 32..432], ArrowBuf[28], udle: [11 
432..445], ArrowBuf[29], udle: [11 448..848]]
 at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:64)
 at 
org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:219)
 at 
org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:121)
{quote}
Here's the code to reproduce this exception:
{code:java}

import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.FieldVector;
import org.apache.arrow.vector.IntVector;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.complex.StructVector;
import org.apache.arrow.vector.dictionary.DictionaryProvider;
import org.apache.arrow.vector.ipc.ArrowStreamReader;
import org.apache.arrow.vector.ipc.ArrowStreamWriter;
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;

import java.io.ByteArrayInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.ThreadLocalRandom;

public class StructTest {

public static void writeBatch(OutputStream os) throws IOException {
List fields = Collections.singletonList(new Field("f-Struct(Int, 
Int)", FieldType.nullable(ArrowType.Struct.INSTANCE), null));
Schema schema = new Schema(fields);
VectorSchemaRoot root = VectorSchemaRoot.create(schema, new 
RootAllocator(Integer.MAX_VALUE));
DictionaryProvider.MapDictionaryProvider provider = new 
DictionaryProvider.MapDictionaryProvider();
ArrowStreamWriter writer = new ArrowStreamWriter(root, provider, os);
writer.start();
for (int i = 0; i < 2; i++) {
root.setRowCount(100);
List vectors = root.getFieldVectors();
StructVector vector = (StructVector) vectors.get(0);
fillVector(vector, 100);
for (int j = 0; j < 100; j++) {
if (!vector.isNull(j)) {
System.out.println(vector.getObject(j));
}
}
writer.writeBatch();
}
writer.end();
writer.close();
}

public static void fillVector(StructVector vector, int batchSize) {
vector.setInitialCapacity(batchSize);
vector.allocateNew();
vector.addOrGet("s1", FieldType.nullable(new ArrowType.Int(32, true)), 
IntVector.class);
vector.addOrGet("s2", FieldType.nullable(new ArrowType.Int(32, true)), 
IntVector.class);
fillVector((IntVector)(vector.getChild("s1")), batchSize);
fillVector((IntVector) (vector.getChild("s2")), batchSize);
for (int i = 0; i < batchSize; i++) {
vector.setIndexDefined(i);
}
vector.setValueCount(batchSize);
}

public static void fillVector(IntVector vector, int batchSize) {
vector.setInitialCapacity(batchSize);
vector.allocateNew();
for (int i = 0; i < batchSize; i++) {
vector.setSafe(i, 1, ThreadLocalRandom.current().nextInt());
}
vector.setValueCount(batchSize);
}

public static void main(String[] args) throws IOException {
try (FileOutputStream fos = new 
FileOutputStream("result/struct.arrow")) {
writeBatch(fos);
System.out.println("write succeed");
fos.flush();
}

RootAllocator allocator = new RootAllocator(10);

ByteArrayInputStream in = new 
ByteArrayInputStream(Files.readAllBytes(Paths.get("result/struct.arrow")));

ArrowStreamReader reader = new ArrowStreamReader(in, allocator);

reader.loadNextBatch();

}
}
{code}
 If I make struct record batches in python, java can  read it back:

Write data:

 
{code:java}
def make_struct(path, batch_size=200, num_batch=2):
obj = get_struct_obj(batch_size)
batch = pa.RecordBatch.from_arrays([obj], ['fo'])
writer = pa.RecordBatchStreamWriter(path,