[jira] [Created] (ARROW-6053) [Python] RecordBatchStreamReader::Open2 cdef type signature doesn't match C++

2019-07-26 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-6053:
--

 Summary: [Python] RecordBatchStreamReader::Open2 cdef type 
signature doesn't match C++
 Key: ARROW-6053
 URL: https://issues.apache.org/jira/browse/ARROW-6053
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Affects Versions: 0.14.1
Reporter: Paul Taylor
Assignee: Paul Taylor


The Cython method signature for RecordBatchStreamReader::Open2 doesn't match 
the C++ type signature and causes a compiler type error trying to call Open2 
from Cython.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


BigQuery Storage API now supports Arow

2019-07-26 Thread Micah Kornfield
Hi Arrow Dev,
As a follow-up to an old thread [1] on working with BigQuery and Arrow. I
just wanted to share some work that Brian Hulette and I helped out with.

I'm happy to announce there is now preliminary support for reading Arrow
data in the BigQuery Storage API [1].  Python library support is available
in the latest release of google-cloud-bigquery-storage [2][3].

Caveats:
- Small cached tables are not supported (same with Avro)
- Row filters aren't supported yet.

Cheers,
Micah

[1]
https://lists.apache.org/thread.html/6d374dc6c948d3e84b1f0feda1d48eddf905a99c0ef569d46af7f7af@%3Cdev.arrow.apache.org%3E
[2] https://cloud.google.com/bigquery/docs/reference/storage/
[3] https://pypi.org/project/google-cloud-bigquery-storage/
[4]
https://googleapis.github.io/google-cloud-python/latest/bigquery_storage/gapic/v1beta1/reader.html#google.cloud.bigquery_storage_v1beta1.reader.ReadRowsIterable.to_arrow


Re: [VOTE] Adopt FORMAT and LIBRARY SemVer-based version schemes for Arrow 1.0.0 and beyond

2019-07-26 Thread Micah Kornfield
+1 (non-binding) Adopt these version conventions and compatibility
guarantees as of Apache Arrow 1.0.0

On Fri, Jul 26, 2019 at 12:34 PM Wes McKinney  wrote:

> hello,
>
> As discussed on the mailing list thread [1], Micah Kornfield has
> proposed a version scheme for the project to take effect starting with
> the 1.0.0 release. See document [2] containing a discussion of the
> issues involved.
>
> To summarize my understanding of the plan:
>
> 1. TWO VERSIONS: As of 1.0.0, we establish separate FORMAT and LIBRARY
> versions. Currently there is only a single version number.
>
> 2. SEMANTIC VERSIONING: We follow https://semver.org/ with regards to
> communicating library API changes. Given the project's pace of
> evolution, most releases are likely to be MAJOR releases according to
> SemVer principles.
>
> 3. RELEASES: Releases of the project will be named according to the
> LIBRARY version. A major release may or may not change the FORMAT
> version. When a LIBRARY version has been released for a new FORMAT
> version, the latter is considered to be released and official.
>
> 4. Each LIBRARY version will have a corresponding FORMAT version. For
> example, LIBRARY versions 2.0.0 and 3.0.0 may track FORMAT version
> 1.0.0. The idea is that FORMAT version will change less often than
> LIBRARY version.
>
> 5. BACKWARD COMPATIBILITY GUARANTEE: A newer versioned client library
> will be able to read any data and metadata produced by an older client
> library.
>
> 6. FORWARD COMPATIBILITY GUARANTEE: An older client library must be
> able to either read data generated from a new client library or detect
> that it cannot properly read the data.
>
> 7. FORMAT MINOR VERSIONS: An increase in the minor version of the
> FORMAT version, such as 1.0.0 to 1.1.0, indicates that 1.1.0 contains
> new features not available in 1.0.0. So long as these features are not
> used (such as a new logical data type), forward compatibility is
> preserved.
>
> 8. FORMAT MAJOR VERSIONS: A change in the FORMAT major version
> indicates a disruption to these compatibility guarantees in some way.
> Hopefully we don't have to do this many times in our respective
> lifetimes
>
> If I've misrepresented some aspect of the proposal it's fine to
> discuss more and we can start a new votes.
>
> Please vote to approve this proposal. I'd like to keep this vote open
> for 7 days (until Friday August 2) to allow for ample opportunities
> for the community to have a look.
>
> [ ] +1 Adopt these version conventions and compatibility guarantees as
> of Apache Arrow 1.0.0
> [ ] +0
> [ ] -1 I disagree because...
>
> Here is my vote: +1
>
> Thanks
> Wes
>
> [1]:
> https://lists.apache.org/thread.html/5715a4d402c835d22d929a8069c5c0cf232077a660ee98639d544af8@%3Cdev.arrow.apache.org%3E
> [2]:
> https://docs.google.com/document/d/1uBitWu57rDu85tNHn0NwstAbrlYqor9dPFg_7QaE-nc/edit#
>


[jira] [Created] (ARROW-6052) [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to builder files

2019-07-26 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6052:
---

 Summary: [C++] Divide up arrow/array.h,cc into files in 
arrow/array/ similar to builder files
 Key: ARROW-6052
 URL: https://issues.apache.org/jira/browse/ARROW-6052
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Since these files are getting larger, this would improve codebase navigability. 
Probably should use the same naming scheme as builder_* e.g. 
{{arrow/array/array_dict.h}}

I recommend also putting the unit test files related to these in there for 
better semantic organization. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6050) [Java] Update out-of-date java/flight/README.md

2019-07-26 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6050:
---

 Summary: [Java] Update out-of-date java/flight/README.md
 Key: ARROW-6050
 URL: https://issues.apache.org/jira/browse/ARROW-6050
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Wes McKinney
 Fix For: 1.0.0


See example bug report

https://github.com/apache/arrow/issues/4955



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6048) [C++] Add ChunkedArray::View which calls to Array::View

2019-07-26 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6048:
---

 Summary: [C++] Add ChunkedArray::View which calls to Array::View
 Key: ARROW-6048
 URL: https://issues.apache.org/jira/browse/ARROW-6048
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This convenience will help with zero-copy casting from one compatible type to 
another

I implemented a workaround for this in ARROW-3772



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6049) [C++] Support using Array::View from compatible dictionary type to another

2019-07-26 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6049:
---

 Summary: [C++] Support using Array::View from compatible 
dictionary type to another
 Key: ARROW-6049
 URL: https://issues.apache.org/jira/browse/ARROW-6049
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


For example, zero-copy conversion from {{dictionary(int32(), binary())}} to 
{{dictionary(int32(), utf8())}}. The implementation must remember to call 
{{View}} on {{ArrayData::dictionary}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[VOTE] Adopt FORMAT and LIBRARY SemVer-based version schemes for Arrow 1.0.0 and beyond

2019-07-26 Thread Wes McKinney
hello,

As discussed on the mailing list thread [1], Micah Kornfield has
proposed a version scheme for the project to take effect starting with
the 1.0.0 release. See document [2] containing a discussion of the
issues involved.

To summarize my understanding of the plan:

1. TWO VERSIONS: As of 1.0.0, we establish separate FORMAT and LIBRARY
versions. Currently there is only a single version number.

2. SEMANTIC VERSIONING: We follow https://semver.org/ with regards to
communicating library API changes. Given the project's pace of
evolution, most releases are likely to be MAJOR releases according to
SemVer principles.

3. RELEASES: Releases of the project will be named according to the
LIBRARY version. A major release may or may not change the FORMAT
version. When a LIBRARY version has been released for a new FORMAT
version, the latter is considered to be released and official.

4. Each LIBRARY version will have a corresponding FORMAT version. For
example, LIBRARY versions 2.0.0 and 3.0.0 may track FORMAT version
1.0.0. The idea is that FORMAT version will change less often than
LIBRARY version.

5. BACKWARD COMPATIBILITY GUARANTEE: A newer versioned client library
will be able to read any data and metadata produced by an older client
library.

6. FORWARD COMPATIBILITY GUARANTEE: An older client library must be
able to either read data generated from a new client library or detect
that it cannot properly read the data.

7. FORMAT MINOR VERSIONS: An increase in the minor version of the
FORMAT version, such as 1.0.0 to 1.1.0, indicates that 1.1.0 contains
new features not available in 1.0.0. So long as these features are not
used (such as a new logical data type), forward compatibility is
preserved.

8. FORMAT MAJOR VERSIONS: A change in the FORMAT major version
indicates a disruption to these compatibility guarantees in some way.
Hopefully we don't have to do this many times in our respective
lifetimes

If I've misrepresented some aspect of the proposal it's fine to
discuss more and we can start a new votes.

Please vote to approve this proposal. I'd like to keep this vote open
for 7 days (until Friday August 2) to allow for ample opportunities
for the community to have a look.

[ ] +1 Adopt these version conventions and compatibility guarantees as
of Apache Arrow 1.0.0
[ ] +0
[ ] -1 I disagree because...

Here is my vote: +1

Thanks
Wes

[1]: 
https://lists.apache.org/thread.html/5715a4d402c835d22d929a8069c5c0cf232077a660ee98639d544af8@%3Cdev.arrow.apache.org%3E
[2]: 
https://docs.google.com/document/d/1uBitWu57rDu85tNHn0NwstAbrlYqor9dPFg_7QaE-nc/edit#


Re: [jira] [Created] (ARROW-6047) [Rust] Rust nightly 1.38.0 builds failing

2019-07-26 Thread Andy Grove
The issue seems to be due to a new release of the parquet-format crate
yesterday rather than an issue with Rust nightly

On Fri, Jul 26, 2019 at 1:06 PM Wes McKinney (JIRA)  wrote:

> Wes McKinney created ARROW-6047:
> ---
>
>  Summary: [Rust] Rust nightly 1.38.0 builds failing
>  Key: ARROW-6047
>  URL: https://issues.apache.org/jira/browse/ARROW-6047
>  Project: Apache Arrow
>   Issue Type: Bug
>   Components: Rust
> Reporter: Wes McKinney
>  Fix For: 1.0.0
>
>
> see
>
> * https://travis-ci.org/apache/arrow/jobs/563893205
> * https://ci.ursalabs.org/#/builders/93/builds/669/steps/2/logs/stdio
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.14#76016)
>


[jira] [Created] (ARROW-6047) [Rust] Rust nightly 1.38.0 builds failing

2019-07-26 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6047:
---

 Summary: [Rust] Rust nightly 1.38.0 builds failing
 Key: ARROW-6047
 URL: https://issues.apache.org/jira/browse/ARROW-6047
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Wes McKinney
 Fix For: 1.0.0


see

* https://travis-ci.org/apache/arrow/jobs/563893205
* https://ci.ursalabs.org/#/builders/93/builds/669/steps/2/logs/stdio



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Procedure for seeing through implementation of new Arrow format features

2019-07-26 Thread Wes McKinney
hi folks,

In https://github.com/apache/arrow/pull/4921 Antoine has implemented
binary and unicode type variants with 64-bit offsets (to permit
payloads exceeding 2GB). I believe Micah plans to follow on with a
Java implementation in a separate PR.

We previously have discussed that we should ensure that we have
multiple tested implementations of new Arrow columnar format features
-- we have not discussed how such changes are to be sequenced.

For the avoidance of doubt, my view on the procedure is that we should
not _release_ the project containing additions to the protocol files
without at least 2 independent implementations of those features (with
integration tests, where relevant). I think it's OK for one
implementation to land in master and then the other in a follow up PR.

We'd like to get https://github.com/apache/arrow/pull/4921 merged soon
(today or tomorrow if possible), so if anyone has a different opinion
about how to stage new format implementations in a non-awkward way,
could you please chime in?

Thank you,
Wes


[jira] [Created] (ARROW-6046) Slice RecordBatch of String array with offset 0

2019-07-26 Thread Sascha Hofmann (JIRA)
Sascha Hofmann created ARROW-6046:
-

 Summary: Slice RecordBatch of String array with offset 0
 Key: ARROW-6046
 URL: https://issues.apache.org/jira/browse/ARROW-6046
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.14.1
Reporter: Sascha Hofmann


We are seeing a very similar bug as in ARROW-809, just for a RecordBatch of 
strings. A slice of a RecordBatch with a string column and offset =0 returns 
the whole batch instead.

 
{code:java}
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({ 'b': ['test' for x in range(1000_000)]})
tbl = pa.Table.from_pandas(df)
batch = tbl.to_batches()[0]

batch.slice(0,2).serialize().size
# 4000232

batch.slice(1,2).serialize().size
# 240
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6045) Benchmark for Parquet float and NaN encoding/decoding

2019-07-26 Thread Itamar Turner-Trauring (JIRA)
Itamar Turner-Trauring created ARROW-6045:
-

 Summary: Benchmark for Parquet float and NaN encoding/decoding
 Key: ARROW-6045
 URL: https://issues.apache.org/jira/browse/ARROW-6045
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Itamar Turner-Trauring


It is possible that one point Parquet NaN encoding was slower, so it's worth 
extending the benchmarks for Parquet to do floats in general and NaNs in 
particular.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6044) Pyarrow HDFS client gets hung after a while

2019-07-26 Thread Fred Tzeng (JIRA)
Fred Tzeng created ARROW-6044:
-

 Summary: Pyarrow HDFS client gets hung after a while
 Key: ARROW-6044
 URL: https://issues.apache.org/jira/browse/ARROW-6044
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
 Environment: hadoop-3.0.3
driver='libhdfs'
python 3.6
Centos7
Reporter: Fred Tzeng


I'm using the pyarrow HDFS client in a long running (forever) app that makes 
connections to HDFS as external requests come in and destroys the connection as 
soon as the request is handled. This happens a large amount of times on 
separate threads and everything works great.

The problem is, after the app idles for a while (perhaps hours) and no HDFS 
connections are made during this time, when the next connection is attempted, 
the API hdfs.connect(...) just hangs. No exceptions are thrown.

Code snippet on what i'm doing to instantiate each connection:

...

hdfs = pyarrow.hdfs.connect(self.hdfs_authority, self.hdfs_port, 
user=self.hdfs_user)

try:

//Do something

finally:

hdfs.close

 

Any help on what might be causing these hangs is appreciated

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-26 Thread Micah Kornfield
>
> It's not just computation libraries, it's any library peeking inside
> Arrow data.  Currently, the Arrow data types are simple, which makes it
> easy and non-intimidating to build data processing utilities around
> them.  If we start adding sophisticated encodings, we also raise the
> cost of supporting Arrow for third-party libraries.


This is another legitimate concern about complexity.

To try to limit complexity. I simplified the proposal PR [1] to only have 1
buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding
scheme (RLE) that I think will have the most benefit if exploited
properly.  Compression is removed.

I'd like to get closure on the proposal one way or another.  I think now
the question to be answered is if we are willing to introduce the
additional complexity for the performance improvements they can yield?  Is
there more data that people would like to see that would influence their
decision?

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/4815

On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou  wrote:

> On Mon, 22 Jul 2019 08:40:08 -0700
> Brian Hulette  wrote:
> > To me, the most important aspect of this proposal is the addition of
> sparse
> > encodings, and I'm curious if there are any more objections to that
> > specifically. So far I believe the only one is that it will make
> > computation libraries more complicated. This is absolutely true, but I
> > think it's worth that cost.
>
> It's not just computation libraries, it's any library peeking inside
> Arrow data.  Currently, the Arrow data types are simple, which makes it
> easy and non-intimidating to build data processing utilities around
> them.  If we start adding sophisticated encodings, we also raise the
> cost of supporting Arrow for third-party libraries.
>
> Regards
>
> Antoine.
>
>
>