[jira] [Created] (ARROW-5550) [C++] Refactor Buffers method on concatenate to consolidate code.

2019-06-10 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-5550:
--

 Summary: [C++] Refactor Buffers method on concatenate to 
consolidate code.
 Key: ARROW-5550
 URL: https://issues.apache.org/jira/browse/ARROW-5550
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Micah Kornfield


See https://github.com/apache/arrow/pull/4498/files for reference.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5549) [C++][Docs] Summarize function argument type guidelines in developers/cpp.rst

2019-06-10 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5549:
---

 Summary: [C++][Docs] Summarize function argument type guidelines 
in developers/cpp.rst
 Key: ARROW-5549
 URL: https://issues.apache.org/jira/browse/ARROW-5549
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.14.0


We have a number of spoken and unspoken guidelines around argument passing -- 
some of them are codified in the Google style guide while others (e.g. use of 
smart pointers as function arguments) are applied via convention and enforced 
in code reviews. 

I propose to add a section to make each case explicit so that our code can 
become more hygienic and our code reviews less intense



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5548) [Documentation] http://arrow.apache.org/docs/latest/ is not latest

2019-06-10 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-5548:
--

 Summary: [Documentation] http://arrow.apache.org/docs/latest/ is 
not latest
 Key: ARROW-5548
 URL: https://issues.apache.org/jira/browse/ARROW-5548
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Website
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.14.0


In testing out the Dockerfile for building the docs, I noticed it created an 
asf-site/docs/latest directory at the end. Out of curiosity, I went to 
[http://arrow.apache.org/docs/latest/], and it reports a version of 
{{0.11.1.dev473+g6ed02454}}, which is not close to "latest".

I'd like to see this "latest" site get updated automatically. I'm working on 
getting this Docker setup complete (cf. 
https://issues.apache.org/jira/browse/ARROW-5497), and once that's working, it 
should be feasible to add a Travis-CI job to update /docs/latest on every 
commit to master to apache/arrow. 

cc [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] 32- and 64-bit decimal types

2019-06-10 Thread Wes McKinney
On Mon, Jun 10, 2019 at 4:18 PM Wes McKinney  wrote:
>
> On the 1.0.0 protocol discussion, one item that we've skirted for some
> time is other decimal sizes:
>
> https://issues.apache.org/jira/browse/ARROW-2009
>
> I understand this is a loaded subject since a deliberate decision was
> made to remove types from the initial Java implementation of Arrow
> that was forked from Apache Drill. However, it's a friction point that
> has come up in a number of scenarios as many database and storage
> systems have 32- and 64-bit variants for low precision decimal data.
> As an example Apache Kudu [1] has all three types, and the Parquet
> columnar format allows not only 32/64 bit storage but fixed size
> binary (size a function of precision) and variable-length binary
> encoding [2].
>
> One of the arguments against using these types in a computational
> setting is that many mathematical operations will necessarily trigger
> an up-promotion to a larger type. It's hard for us to predict how
> people will use the Arrow format, though, and the current situation is
> forcing an up-promotion regardless of how the format is being used,
> even for simple data transport
>
> In anticipation of long-term needs, I would suggest a possible solution of:
>
> * Adding bitWidth field to Decimal table in Schema.fbs [3] with
> default value of 128
> * Constraining bit widths to 32, 64, and 128 bits for the time being
> * Permit storage of smaller precision decimals in larger storage like
> we have now

BTW, even if we do not allow 32/64 bit decimals in the format, we
should consider adding a bitWidth field with static value 128 as a
matter of future-proofing the metadata. This change would make it so
that old readers are unable to see the bitWidth field, so the addition
would not be possible without bumping the protocol version.

>
> If this isn't deemed desirable by the community, decimal extension
> types could be employed for serialization-free transport for smaller
> decimals, but I view this as suboptimal.
>
> Interested in the thoughts of others.
>
> thanks
> Wes
>
> [1]: 
> https://github.com/apache/kudu/blob/master/src/kudu/common/common.proto#L55
> [2]: 
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L121


Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-10 Thread Jacques Nadeau
Sounds good.

On Mon, Jun 10, 2019 at 11:06 AM Wes McKinney  wrote:

> Hi all,
>
> OK, it sounds like there is reasonable consensus behind the plan:
>
> * Make a 0.14.0 release in the near future (later this month?)
> * Publicize that the next release will be 1.0.0, in a "speak now or
> hold your peace" fashion
> * Release 1.0.0 as following release. I would suggest not waiting too
> long, so late August / early September time frame
>
> I'm going to continue grooming the 0.14.0 backlog to help refine the
> scope of what still needs to be done for C++/Python to get the next
> release out. If the stakeholders in various project subcomponents
> could also groom the backlog and mark any blockers, that would be very
> helpful.
>
> I suggest shooting for a release candidate for 0.14.0 either the week
> of June 24 or July 1 (depending on where things stand)
>
> Thanks
> Wes
>
> On Mon, Jun 10, 2019 at 2:39 AM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > I think that 0.14.0 is better for the next version.
> >
> > People who don't try Apache Arrow yet to wait 1.0.0 will use
> > Apache Arrow when we release 1.0.0. If 1.0.0 satisfies them,
> > we will get more users and contributors by 1.0.0. They may
> > not care protocol stability. They may just care "1.0.0".
> >
> > We'll be able to release less problem 1.0.0 by releasing
> > 0.14.0 as RC for 1.0.0. 0.14.0 will be used more people than
> > 1.0.0-RCX. 0.14.0 users will find critical problems.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: [DISCUSS] Timing of release and making a 1.0.0 release marking
> Arrow protocol stability" on Fri, 7 Jun 2019 22:28:22 -0700,
> >   Micah Kornfield  wrote:
> >
> > > A few thoughts:
> > > - I think we should iron out the remaining incompatibilities between
> java
> > > and C++ before going to 1.0.0 (at least Union and NullType), and I'm
> not
> > > sure I will have time to them before the next release, so I would
> prefer to
> > > try to aim for the subsequent release to make it 1.0.0
> > > - For 1.0.0 should we change the metadata format version to a new
> naming
> > > scheme [1] (seems like more of a hassle then it is worth)?
> > > - I'm a little concerned about the implications for
> forward-compatibility
> > > restrictions for format changes.  For instance the large list types
> would
> > > not be forward compatible (at least by some definitions), similarly if
> we
> > > deal with compression [2] it would also seem to not be forward
> compatible.
> > > Would this mean we bump the format version number for each change even
> > > though they would be backwards compatible?
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
> > > [2] https://issues.apache.org/jira/browse/ARROW-300
> > >
> > > On Fri, Jun 7, 2019 at 12:42 PM Wes McKinney 
> wrote:
> > >
> > >> I agree re: marketing value of a 1.0.0 release.
> > >>
> > >> For the record, I think we should continue to allow the API of each
> > >> respective library component to evolve freely and allow the
> > >> individuals developing each to decide how to handle deprecations, API
> > >> changes, etc., as we have up until this point. The project is still
> > >> very much in "innovation mode" across the board, but some parts may
> > >> grow more conservative than others. Having roughly time-based releases
> > >> encourages everyone to be ready-to-release at any given time, and we
> > >> develop a steady cadence of getting new functionality and
> > >> improvements/fixes out the door.
> > >>
> > >> On Fri, Jun 7, 2019 at 1:25 PM Antoine Pitrou 
> wrote:
> > >> >
> > >> >
> > >> > I think there's a marketing merit to issuing a 1.0.0 release.
> > >> >
> > >> > Regards
> > >> >
> > >> > Antoine.
> > >> >
> > >> >
> > >> > Le 07/06/2019 à 20:05, Wes McKinney a écrit :
> > >> > > So one idea is that we could call the next release 1.14.0. So the
> > >> > > second number is the API version number. This encodes a
> sequencing of
> > >> > > the evolution of the API. The library APIs are already decoupled
> from
> > >> > > the binary serialization protocol, so I think we merely have to
> state
> > >> > > that API changes and protocol changes are not related to each
> other.
> > >> > >
> > >> > > On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau <
> jacq...@apache.org>
> > >> wrote:
> > >> > >>
> > >> > >> It brings up an interesting point... do we couple the stability
> of
> > >> the apis
> > >> > >> with the stability of the protocol. If the protocol is stable, we
> > >> should
> > >> > >> start providing guarantees for it. How do we want to express
> these
> > >> > >> different velocities?
> > >> > >>
> > >> > >> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou <
> anto...@python.org>
> > >> wrote:
> > >> > >>
> > >> > >>>
> > >> > >>> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit :
> > >> >  On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou <
> anto...@python.org>
> > >> > >>> wrote:
> > >> > 
> > >> > > Hi Wes,
> > >> 

[DISCUSS] 32- and 64-bit decimal types

2019-06-10 Thread Wes McKinney
On the 1.0.0 protocol discussion, one item that we've skirted for some
time is other decimal sizes:

https://issues.apache.org/jira/browse/ARROW-2009

I understand this is a loaded subject since a deliberate decision was
made to remove types from the initial Java implementation of Arrow
that was forked from Apache Drill. However, it's a friction point that
has come up in a number of scenarios as many database and storage
systems have 32- and 64-bit variants for low precision decimal data.
As an example Apache Kudu [1] has all three types, and the Parquet
columnar format allows not only 32/64 bit storage but fixed size
binary (size a function of precision) and variable-length binary
encoding [2].

One of the arguments against using these types in a computational
setting is that many mathematical operations will necessarily trigger
an up-promotion to a larger type. It's hard for us to predict how
people will use the Arrow format, though, and the current situation is
forcing an up-promotion regardless of how the format is being used,
even for simple data transport

In anticipation of long-term needs, I would suggest a possible solution of:

* Adding bitWidth field to Decimal table in Schema.fbs [3] with
default value of 128
* Constraining bit widths to 32, 64, and 128 bits for the time being
* Permit storage of smaller precision decimals in larger storage like
we have now

If this isn't deemed desirable by the community, decimal extension
types could be employed for serialization-free transport for smaller
decimals, but I view this as suboptimal.

Interested in the thoughts of others.

thanks
Wes

[1]: https://github.com/apache/kudu/blob/master/src/kudu/common/common.proto#L55
[2]: 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
[3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L121


[jira] [Created] (ARROW-5547) [C++][Flight] arrow-flight.pc isn't provided

2019-06-10 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5547:
---

 Summary: [C++][Flight] arrow-flight.pc isn't provided
 Key: ARROW-5547
 URL: https://issues.apache.org/jira/browse/ARROW-5547
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC
Reporter: Sutou Kouhei






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5546) [C#] Remove IArrowArray and use Array base class.

2019-06-10 Thread Eric Erhardt (JIRA)
Eric Erhardt created ARROW-5546:
---

 Summary: [C#] Remove IArrowArray and use Array base class.
 Key: ARROW-5546
 URL: https://issues.apache.org/jira/browse/ARROW-5546
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Affects Versions: 0.13.0
Reporter: Eric Erhardt


In .NET libraries, we have historically favored classes (abstract or otherwise) 
over interfaces. See [Choosing Between Classes and 
Interfaces|https://docs.microsoft.com/en-us/previous-versions/dotnet/netframework-4.0/ms229013(v%3dvs.100)].
 The main reasoning is that you can add members to a class over time, but once 
you ship an interface, it can never be changed. You can only add new interfaces.

 In light of this, we should remove the IArrowArray interface, and instead just 
the base `Array` class as the abstraction for all Arrow Arrays.

As part of this, we should also consider renaming `Array` because it conflicts 
with the System.Array type. Instead we should consider naming it `ArrowArray` 
to make it unique from the very common System.Array type in .NET.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[VOTE] Formalizing "Extension Type" metadata in Arrow binary protocol

2019-06-10 Thread Wes McKinney
hi folks,

In two mailing list threads [1] [2] we have discussed adding an
"extension type" mechanism to the Arrow binary/IPC protocol. The idea
is to be able to "annotate" built-in Arrow data types with a type name
and serialized type data/metadata so that users can implement their
own custom columnar data containers that contain application-defined
business logic not built-in to the Arrow libraries. This is designed
to be non-obtrusive: readers who are not aware of an extension type
can interact with the built-in Arrow type opaquely, and propagate the
extension metadata unmodified

As two examples:

* "uuid" may annotate "fixed size binary of value width 16 bytes"
* "latitude-longitude" may annotate "struct"
or similar

An implementation may provide specialized columnar containers with
additional business logic around manipulating such data in-memory as
required for application development

We also have prototype implementations of this mechanism ready to go
in C++ and Java. I have proposed language additions to the
specification [3] and the C++ implementation with the following
tenets:

- The custom_metadata Flatbuffers field shall use the colon character
":" as a namespace separator
- "ARROW" is designated as a reserved namespace in custom_metadata,
for example "ARROW:property"
- There may be multiple levels of namespacing, for example:
"ARROW:myorg:property_name"
- Extension type fields "ARROW:extension:name" and
"ARROW:extension:metadata" are reserved in custom_metadata to enable
serialization of extension type information
- The details of implementation and how extension types are exposed to
library users is implementation dependent

Please vote to accept these changes (see [3] for the actual changes).
The vote will be open for at least 72 hours

[ ] +1: Adopt these changes into the Arrow columnar format specification
[ ] +0: . . .
[ ] -1: I disagree because . . .

Here is my vote: +1

[1]: 
https://lists.apache.org/thread.html/96c3f5fe64f45a4c5ccac0562dbfd356b76cd722aa521100b5988d40@%3Cdev.arrow.apache.org%3E
[2]: 
https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E
[3]: https://github.com/apache/arrow/pull/4332


Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-10 Thread Wes McKinney
Hi all,

OK, it sounds like there is reasonable consensus behind the plan:

* Make a 0.14.0 release in the near future (later this month?)
* Publicize that the next release will be 1.0.0, in a "speak now or
hold your peace" fashion
* Release 1.0.0 as following release. I would suggest not waiting too
long, so late August / early September time frame

I'm going to continue grooming the 0.14.0 backlog to help refine the
scope of what still needs to be done for C++/Python to get the next
release out. If the stakeholders in various project subcomponents
could also groom the backlog and mark any blockers, that would be very
helpful.

I suggest shooting for a release candidate for 0.14.0 either the week
of June 24 or July 1 (depending on where things stand)

Thanks
Wes

On Mon, Jun 10, 2019 at 2:39 AM Sutou Kouhei  wrote:
>
> Hi,
>
> I think that 0.14.0 is better for the next version.
>
> People who don't try Apache Arrow yet to wait 1.0.0 will use
> Apache Arrow when we release 1.0.0. If 1.0.0 satisfies them,
> we will get more users and contributors by 1.0.0. They may
> not care protocol stability. They may just care "1.0.0".
>
> We'll be able to release less problem 1.0.0 by releasing
> 0.14.0 as RC for 1.0.0. 0.14.0 will be used more people than
> 1.0.0-RCX. 0.14.0 users will find critical problems.
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow 
> protocol stability" on Fri, 7 Jun 2019 22:28:22 -0700,
>   Micah Kornfield  wrote:
>
> > A few thoughts:
> > - I think we should iron out the remaining incompatibilities between java
> > and C++ before going to 1.0.0 (at least Union and NullType), and I'm not
> > sure I will have time to them before the next release, so I would prefer to
> > try to aim for the subsequent release to make it 1.0.0
> > - For 1.0.0 should we change the metadata format version to a new naming
> > scheme [1] (seems like more of a hassle then it is worth)?
> > - I'm a little concerned about the implications for forward-compatibility
> > restrictions for format changes.  For instance the large list types would
> > not be forward compatible (at least by some definitions), similarly if we
> > deal with compression [2] it would also seem to not be forward compatible.
> > Would this mean we bump the format version number for each change even
> > though they would be backwards compatible?
> >
> > Thanks,
> > Micah
> >
> > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
> > [2] https://issues.apache.org/jira/browse/ARROW-300
> >
> > On Fri, Jun 7, 2019 at 12:42 PM Wes McKinney  wrote:
> >
> >> I agree re: marketing value of a 1.0.0 release.
> >>
> >> For the record, I think we should continue to allow the API of each
> >> respective library component to evolve freely and allow the
> >> individuals developing each to decide how to handle deprecations, API
> >> changes, etc., as we have up until this point. The project is still
> >> very much in "innovation mode" across the board, but some parts may
> >> grow more conservative than others. Having roughly time-based releases
> >> encourages everyone to be ready-to-release at any given time, and we
> >> develop a steady cadence of getting new functionality and
> >> improvements/fixes out the door.
> >>
> >> On Fri, Jun 7, 2019 at 1:25 PM Antoine Pitrou  wrote:
> >> >
> >> >
> >> > I think there's a marketing merit to issuing a 1.0.0 release.
> >> >
> >> > Regards
> >> >
> >> > Antoine.
> >> >
> >> >
> >> > Le 07/06/2019 à 20:05, Wes McKinney a écrit :
> >> > > So one idea is that we could call the next release 1.14.0. So the
> >> > > second number is the API version number. This encodes a sequencing of
> >> > > the evolution of the API. The library APIs are already decoupled from
> >> > > the binary serialization protocol, so I think we merely have to state
> >> > > that API changes and protocol changes are not related to each other.
> >> > >
> >> > > On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau 
> >> wrote:
> >> > >>
> >> > >> It brings up an interesting point... do we couple the stability of
> >> the apis
> >> > >> with the stability of the protocol. If the protocol is stable, we
> >> should
> >> > >> start providing guarantees for it. How do we want to express these
> >> > >> different velocities?
> >> > >>
> >> > >> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou 
> >> wrote:
> >> > >>
> >> > >>>
> >> > >>> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit :
> >> >  On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou 
> >> > >>> wrote:
> >> > 
> >> > > Hi Wes,
> >> > >
> >> > > Le 07/06/2019 à 17:42, Wes McKinney a écrit :
> >> > >>
> >> > >> I think
> >> > >> this would have a lot of benefits for project onlookers to remove
> >> > >> various warnings around the codebase around stability and cautions
> >> > >> against persistence of protocol data. It's fair to say that if we
> >> _do_
> >> > >> make changes in the future, 

[jira] [Created] (ARROW-5545) Clarify expectation of UTC values for timestamps with time zones in C++ API docs

2019-06-10 Thread TP Boudreau (JIRA)
TP Boudreau created ARROW-5545:
--

 Summary: Clarify expectation of UTC values for timestamps with 
time zones in C++ API docs
 Key: ARROW-5545
 URL: https://issues.apache.org/jira/browse/ARROW-5545
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: TP Boudreau
Assignee: TP Boudreau
 Fix For: 0.14.0


For timestamp datatypes, if the timezone parameter is non-empty, the int64 
array values in the associated column are assumed to be normalized to UTC.  
This requirement should be made clear to the C++ API user.  (It can be inferred 
from the flatbuffers schema, but that internal implementation document probably 
wouldn't ordinarily be consulted by a C++ API consumer.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5544) [Archery] should not return non-zero in `benchmark diff` sub command on regression

2019-06-10 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5544:
-

 Summary: [Archery] should not return non-zero in `benchmark diff` 
sub command on regression
 Key: ARROW-5544
 URL: https://issues.apache.org/jira/browse/ARROW-5544
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


When a regression is detected, but the command ran successfully, it should 
return zero. Currently it returns the number of regression. This is to play 
better with ursabot. It should be left to the user to decide what to do with 
the json data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5543) [Documentation] Migrate FAQ page to Sphinx / rst around release time

2019-06-10 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5543:
---

 Summary: [Documentation] Migrate FAQ page to Sphinx / rst around 
release time
 Key: ARROW-5543
 URL: https://issues.apache.org/jira/browse/ARROW-5543
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney
 Fix For: 0.14.0


In ARROW-973, a Markdown page with the FAQ was added. When we are close to 
publishing a new version of the Sphinx site, it would make sense to move the 
FAQ to the main docs project and link from the project from page



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5542) [Java] Bootstrap initial developer documentation in docs/source/developers/java.rst

2019-06-10 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5542:
---

 Summary: [Java] Bootstrap initial developer documentation in 
docs/source/developers/java.rst
 Key: ARROW-5542
 URL: https://issues.apache.org/jira/browse/ARROW-5542
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Wes McKinney
 Fix For: 0.14.0


The project lacks prose documentation about Java development. I propose to 
begin a section about it in the Sphinx project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5541) [R] cast from negative int32 to uint32 and uint64 are now safe

2019-06-10 Thread JIRA
Romain François created ARROW-5541:
--

 Summary: [R] cast from negative int32 to uint32 and uint64 are now 
safe
 Key: ARROW-5541
 URL: https://issues.apache.org/jira/browse/ARROW-5541
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Romain François
Assignee: Romain François
 Fix For: 0.14.0


The test just need some updates. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5540) pa.lib.tzinfo_to_string(tz) throws ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string

2019-06-10 Thread JIRA
Michał Kujawski created ARROW-5540:
--

 Summary: pa.lib.tzinfo_to_string(tz) throws ValueError: Unable to 
convert timezone `tzoffset(None, -14400)` to string
 Key: ARROW-5540
 URL: https://issues.apache.org/jira/browse/ARROW-5540
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Michał Kujawski


*Overview:*

When trying to save DataFrame to parquet error is thrown while parsing a column 
with the following properties:

 
{code:java}
dtype: datetime64[ns, tzoffset(None, -14400)]
dtype.tz: tzoffset(None, -14400)
{code}
 

 

*Error:* 
{code:java}
ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string{code}
 

*Error stack:*
{code:java}
File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
File 
"/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
 line 480, in dataframe_to_arrays
types)
File 
"/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
 line 209, in construct_metadata
field_name=sanitized_name)
File 
"/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
 line 153, in get_column_metadata
string_dtype, extra_metadata = get_extension_dtype_info(column)
File 
"/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
 line 126, in get_extension_dtype_info
metadata = {'timezone': pa.lib.tzinfo_to_string(dtype.tz)}
File "pyarrow/types.pxi", line 1149, in pyarrow.lib.tzinfo_to_string
ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string
{code}
*Libraries:*
 * pandas 0.24.2
 * pyarrow 0.13.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5539) [Java] Test failure

2019-06-10 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5539:
-

 Summary: [Java] Test failure
 Key: ARROW-5539
 URL: https://issues.apache.org/jira/browse/ARROW-5539
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Antoine Pitrou


I know next to nothing about Java ecosystems. I'm trying to build and test 
locally, and get the following failures:
{code}
[ERROR] Tests run: 6, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.011 s 
<<< FAILURE! - in io.netty.buffer.TestArrowBuf
[ERROR] testSetBytesSliced(io.netty.buffer.TestArrowBuf)  Time elapsed: 0.004 s 
 <<< ERROR!
java.lang.NoSuchMethodError: 
io.netty.buffer.ArrowBuf.setBytes(ILjava/nio/ByteBuffer;II)Lio/netty/buffer/ArrowBuf;
at 
io.netty.buffer.TestArrowBuf.testSetBytesSliced(TestArrowBuf.java:100)

[ERROR] testSetBytesUnsliced(io.netty.buffer.TestArrowBuf)  Time elapsed: 0 s  
<<< ERROR!
java.lang.NoSuchMethodError: 
io.netty.buffer.ArrowBuf.setBytes(ILjava/nio/ByteBuffer;II)Lio/netty/buffer/ArrowBuf;
at 
io.netty.buffer.TestArrowBuf.testSetBytesUnsliced(TestArrowBuf.java:121)

12:27:49.541 [main] WARN  o.apache.arrow.memory.BoundsChecking - 
"drill.enable_unsafe_memory_access" has been renamed to 
"arrow.enable_unsafe_memory_access"
12:27:49.543 [main] WARN  o.apache.arrow.memory.BoundsChecking - 
"arrow.enable_unsafe_memory_access" can be set to:  true (to not check) or 
false (to check, default)
12:27:49.617 [main] WARN  o.apache.arrow.memory.BoundsChecking - 
"drill.enable_unsafe_memory_access" has been renamed to 
"arrow.enable_unsafe_memory_access"
12:27:49.619 [main] WARN  o.apache.arrow.memory.BoundsChecking - 
"arrow.enable_unsafe_memory_access" can be set to:  true (to not check) or 
false (to check, default)
{code}

Java version is the following:
{code}
$ java -version
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
{code}

I'm on Ubuntu 18.04. Perhaps I need another JVM?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[Discuss][Java][Typical use cases for dictionary encoding string vectors]

2019-06-10 Thread Fan Liya
Hi all,

This is concerning issue ARROW-3396.

I have summarized the problem (please see if my understanding is correct),
and proposed some solutions to it. Please give your valuable feedback.
For details, please see:

https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing

Thank you in advance.

Best,
Liya Fan


Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-10 Thread Sutou Kouhei
Hi,

I think that 0.14.0 is better for the next version.

People who don't try Apache Arrow yet to wait 1.0.0 will use
Apache Arrow when we release 1.0.0. If 1.0.0 satisfies them,
we will get more users and contributors by 1.0.0. They may
not care protocol stability. They may just care "1.0.0".

We'll be able to release less problem 1.0.0 by releasing
0.14.0 as RC for 1.0.0. 0.14.0 will be used more people than
1.0.0-RCX. 0.14.0 users will find critical problems.


Thanks,
--
kou

In 
  "Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow 
protocol stability" on Fri, 7 Jun 2019 22:28:22 -0700,
  Micah Kornfield  wrote:

> A few thoughts:
> - I think we should iron out the remaining incompatibilities between java
> and C++ before going to 1.0.0 (at least Union and NullType), and I'm not
> sure I will have time to them before the next release, so I would prefer to
> try to aim for the subsequent release to make it 1.0.0
> - For 1.0.0 should we change the metadata format version to a new naming
> scheme [1] (seems like more of a hassle then it is worth)?
> - I'm a little concerned about the implications for forward-compatibility
> restrictions for format changes.  For instance the large list types would
> not be forward compatible (at least by some definitions), similarly if we
> deal with compression [2] it would also seem to not be forward compatible.
> Would this mean we bump the format version number for each change even
> though they would be backwards compatible?
> 
> Thanks,
> Micah
> 
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
> [2] https://issues.apache.org/jira/browse/ARROW-300
> 
> On Fri, Jun 7, 2019 at 12:42 PM Wes McKinney  wrote:
> 
>> I agree re: marketing value of a 1.0.0 release.
>>
>> For the record, I think we should continue to allow the API of each
>> respective library component to evolve freely and allow the
>> individuals developing each to decide how to handle deprecations, API
>> changes, etc., as we have up until this point. The project is still
>> very much in "innovation mode" across the board, but some parts may
>> grow more conservative than others. Having roughly time-based releases
>> encourages everyone to be ready-to-release at any given time, and we
>> develop a steady cadence of getting new functionality and
>> improvements/fixes out the door.
>>
>> On Fri, Jun 7, 2019 at 1:25 PM Antoine Pitrou  wrote:
>> >
>> >
>> > I think there's a marketing merit to issuing a 1.0.0 release.
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >
>> > Le 07/06/2019 à 20:05, Wes McKinney a écrit :
>> > > So one idea is that we could call the next release 1.14.0. So the
>> > > second number is the API version number. This encodes a sequencing of
>> > > the evolution of the API. The library APIs are already decoupled from
>> > > the binary serialization protocol, so I think we merely have to state
>> > > that API changes and protocol changes are not related to each other.
>> > >
>> > > On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau 
>> wrote:
>> > >>
>> > >> It brings up an interesting point... do we couple the stability of
>> the apis
>> > >> with the stability of the protocol. If the protocol is stable, we
>> should
>> > >> start providing guarantees for it. How do we want to express these
>> > >> different velocities?
>> > >>
>> > >> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou 
>> wrote:
>> > >>
>> > >>>
>> > >>> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit :
>> >  On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou 
>> > >>> wrote:
>> > 
>> > > Hi Wes,
>> > >
>> > > Le 07/06/2019 à 17:42, Wes McKinney a écrit :
>> > >>
>> > >> I think
>> > >> this would have a lot of benefits for project onlookers to remove
>> > >> various warnings around the codebase around stability and cautions
>> > >> against persistence of protocol data. It's fair to say that if we
>> _do_
>> > >> make changes in the future, that there will be a transition path
>> for
>> > >> migrate persisted data, should it ever come to that.
>> > >
>> > > I think that's a good idea, but perhaps the stability promise
>> shouldn't
>> > > cover the Flight protocol yet?
>> > 
>> >  Agreed.
>> > 
>> > >> I would suggest a "1.0.0" release either as our next release
>> (instead
>> > >> of 0.14.0) or the release right after that (if we need more time
>> to
>> > >> get affairs in order), with the guidance for users of:
>> > >
>> > > I think we should first do a regular 0.14.0 with all that's on our
>> plate
>> > > right now, then work towards a 1.0.0 as the release following that.
>> > 
>> >  What is different from your perspective? If the protocol hasn't
>> changed
>> > >>> in
>> >  over a year, why not call it 1.0?
>> > >>>
>> > >>> I would say that perhaps some API cleanup is in order.  Remove
>> > >>> deprecated ones, review experimental APIs, perhaps mark experimental
>> > >>> certain 

[jira] [Created] (ARROW-5538) [C++] Restrict minimum OpenSSL version to 1.0.2

2019-06-10 Thread Deepak Majeti (JIRA)
Deepak Majeti created ARROW-5538:


 Summary: [C++] Restrict minimum OpenSSL version to 1.0.2
 Key: ARROW-5538
 URL: https://issues.apache.org/jira/browse/ARROW-5538
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Deepak Majeti
Assignee: Deepak Majeti


We must enable encryption support in Arrow only if the OpenSSL version is at 
least 1.0.2. The official documentation prohibits using older versions.

[https://www.openssl.org/source/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)