[jira] [Created] (ARROW-6053) [Python] RecordBatchStreamReader::Open2 cdef type signature doesn't match C++
Paul Taylor created ARROW-6053: -- Summary: [Python] RecordBatchStreamReader::Open2 cdef type signature doesn't match C++ Key: ARROW-6053 URL: https://issues.apache.org/jira/browse/ARROW-6053 Project: Apache Arrow Issue Type: New Feature Components: Python Affects Versions: 0.14.1 Reporter: Paul Taylor Assignee: Paul Taylor The Cython method signature for RecordBatchStreamReader::Open2 doesn't match the C++ type signature and causes a compiler type error trying to call Open2 from Cython. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
BigQuery Storage API now supports Arow
Hi Arrow Dev, As a follow-up to an old thread [1] on working with BigQuery and Arrow. I just wanted to share some work that Brian Hulette and I helped out with. I'm happy to announce there is now preliminary support for reading Arrow data in the BigQuery Storage API [1]. Python library support is available in the latest release of google-cloud-bigquery-storage [2][3]. Caveats: - Small cached tables are not supported (same with Avro) - Row filters aren't supported yet. Cheers, Micah [1] https://lists.apache.org/thread.html/6d374dc6c948d3e84b1f0feda1d48eddf905a99c0ef569d46af7f7af@%3Cdev.arrow.apache.org%3E [2] https://cloud.google.com/bigquery/docs/reference/storage/ [3] https://pypi.org/project/google-cloud-bigquery-storage/ [4] https://googleapis.github.io/google-cloud-python/latest/bigquery_storage/gapic/v1beta1/reader.html#google.cloud.bigquery_storage_v1beta1.reader.ReadRowsIterable.to_arrow
Re: [VOTE] Adopt FORMAT and LIBRARY SemVer-based version schemes for Arrow 1.0.0 and beyond
+1 (non-binding) Adopt these version conventions and compatibility guarantees as of Apache Arrow 1.0.0 On Fri, Jul 26, 2019 at 12:34 PM Wes McKinney wrote: > hello, > > As discussed on the mailing list thread [1], Micah Kornfield has > proposed a version scheme for the project to take effect starting with > the 1.0.0 release. See document [2] containing a discussion of the > issues involved. > > To summarize my understanding of the plan: > > 1. TWO VERSIONS: As of 1.0.0, we establish separate FORMAT and LIBRARY > versions. Currently there is only a single version number. > > 2. SEMANTIC VERSIONING: We follow https://semver.org/ with regards to > communicating library API changes. Given the project's pace of > evolution, most releases are likely to be MAJOR releases according to > SemVer principles. > > 3. RELEASES: Releases of the project will be named according to the > LIBRARY version. A major release may or may not change the FORMAT > version. When a LIBRARY version has been released for a new FORMAT > version, the latter is considered to be released and official. > > 4. Each LIBRARY version will have a corresponding FORMAT version. For > example, LIBRARY versions 2.0.0 and 3.0.0 may track FORMAT version > 1.0.0. The idea is that FORMAT version will change less often than > LIBRARY version. > > 5. BACKWARD COMPATIBILITY GUARANTEE: A newer versioned client library > will be able to read any data and metadata produced by an older client > library. > > 6. FORWARD COMPATIBILITY GUARANTEE: An older client library must be > able to either read data generated from a new client library or detect > that it cannot properly read the data. > > 7. FORMAT MINOR VERSIONS: An increase in the minor version of the > FORMAT version, such as 1.0.0 to 1.1.0, indicates that 1.1.0 contains > new features not available in 1.0.0. So long as these features are not > used (such as a new logical data type), forward compatibility is > preserved. > > 8. FORMAT MAJOR VERSIONS: A change in the FORMAT major version > indicates a disruption to these compatibility guarantees in some way. > Hopefully we don't have to do this many times in our respective > lifetimes > > If I've misrepresented some aspect of the proposal it's fine to > discuss more and we can start a new votes. > > Please vote to approve this proposal. I'd like to keep this vote open > for 7 days (until Friday August 2) to allow for ample opportunities > for the community to have a look. > > [ ] +1 Adopt these version conventions and compatibility guarantees as > of Apache Arrow 1.0.0 > [ ] +0 > [ ] -1 I disagree because... > > Here is my vote: +1 > > Thanks > Wes > > [1]: > https://lists.apache.org/thread.html/5715a4d402c835d22d929a8069c5c0cf232077a660ee98639d544af8@%3Cdev.arrow.apache.org%3E > [2]: > https://docs.google.com/document/d/1uBitWu57rDu85tNHn0NwstAbrlYqor9dPFg_7QaE-nc/edit# >
[jira] [Created] (ARROW-6052) [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to builder files
Wes McKinney created ARROW-6052: --- Summary: [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to builder files Key: ARROW-6052 URL: https://issues.apache.org/jira/browse/ARROW-6052 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Since these files are getting larger, this would improve codebase navigability. Probably should use the same naming scheme as builder_* e.g. {{arrow/array/array_dict.h}} I recommend also putting the unit test files related to these in there for better semantic organization. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6050) [Java] Update out-of-date java/flight/README.md
Wes McKinney created ARROW-6050: --- Summary: [Java] Update out-of-date java/flight/README.md Key: ARROW-6050 URL: https://issues.apache.org/jira/browse/ARROW-6050 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Wes McKinney Fix For: 1.0.0 See example bug report https://github.com/apache/arrow/issues/4955 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6048) [C++] Add ChunkedArray::View which calls to Array::View
Wes McKinney created ARROW-6048: --- Summary: [C++] Add ChunkedArray::View which calls to Array::View Key: ARROW-6048 URL: https://issues.apache.org/jira/browse/ARROW-6048 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This convenience will help with zero-copy casting from one compatible type to another I implemented a workaround for this in ARROW-3772 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6049) [C++] Support using Array::View from compatible dictionary type to another
Wes McKinney created ARROW-6049: --- Summary: [C++] Support using Array::View from compatible dictionary type to another Key: ARROW-6049 URL: https://issues.apache.org/jira/browse/ARROW-6049 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 For example, zero-copy conversion from {{dictionary(int32(), binary())}} to {{dictionary(int32(), utf8())}}. The implementation must remember to call {{View}} on {{ArrayData::dictionary}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[VOTE] Adopt FORMAT and LIBRARY SemVer-based version schemes for Arrow 1.0.0 and beyond
hello, As discussed on the mailing list thread [1], Micah Kornfield has proposed a version scheme for the project to take effect starting with the 1.0.0 release. See document [2] containing a discussion of the issues involved. To summarize my understanding of the plan: 1. TWO VERSIONS: As of 1.0.0, we establish separate FORMAT and LIBRARY versions. Currently there is only a single version number. 2. SEMANTIC VERSIONING: We follow https://semver.org/ with regards to communicating library API changes. Given the project's pace of evolution, most releases are likely to be MAJOR releases according to SemVer principles. 3. RELEASES: Releases of the project will be named according to the LIBRARY version. A major release may or may not change the FORMAT version. When a LIBRARY version has been released for a new FORMAT version, the latter is considered to be released and official. 4. Each LIBRARY version will have a corresponding FORMAT version. For example, LIBRARY versions 2.0.0 and 3.0.0 may track FORMAT version 1.0.0. The idea is that FORMAT version will change less often than LIBRARY version. 5. BACKWARD COMPATIBILITY GUARANTEE: A newer versioned client library will be able to read any data and metadata produced by an older client library. 6. FORWARD COMPATIBILITY GUARANTEE: An older client library must be able to either read data generated from a new client library or detect that it cannot properly read the data. 7. FORMAT MINOR VERSIONS: An increase in the minor version of the FORMAT version, such as 1.0.0 to 1.1.0, indicates that 1.1.0 contains new features not available in 1.0.0. So long as these features are not used (such as a new logical data type), forward compatibility is preserved. 8. FORMAT MAJOR VERSIONS: A change in the FORMAT major version indicates a disruption to these compatibility guarantees in some way. Hopefully we don't have to do this many times in our respective lifetimes If I've misrepresented some aspect of the proposal it's fine to discuss more and we can start a new votes. Please vote to approve this proposal. I'd like to keep this vote open for 7 days (until Friday August 2) to allow for ample opportunities for the community to have a look. [ ] +1 Adopt these version conventions and compatibility guarantees as of Apache Arrow 1.0.0 [ ] +0 [ ] -1 I disagree because... Here is my vote: +1 Thanks Wes [1]: https://lists.apache.org/thread.html/5715a4d402c835d22d929a8069c5c0cf232077a660ee98639d544af8@%3Cdev.arrow.apache.org%3E [2]: https://docs.google.com/document/d/1uBitWu57rDu85tNHn0NwstAbrlYqor9dPFg_7QaE-nc/edit#
Re: [jira] [Created] (ARROW-6047) [Rust] Rust nightly 1.38.0 builds failing
The issue seems to be due to a new release of the parquet-format crate yesterday rather than an issue with Rust nightly On Fri, Jul 26, 2019 at 1:06 PM Wes McKinney (JIRA) wrote: > Wes McKinney created ARROW-6047: > --- > > Summary: [Rust] Rust nightly 1.38.0 builds failing > Key: ARROW-6047 > URL: https://issues.apache.org/jira/browse/ARROW-6047 > Project: Apache Arrow > Issue Type: Bug > Components: Rust > Reporter: Wes McKinney > Fix For: 1.0.0 > > > see > > * https://travis-ci.org/apache/arrow/jobs/563893205 > * https://ci.ursalabs.org/#/builders/93/builds/669/steps/2/logs/stdio > > > > -- > This message was sent by Atlassian JIRA > (v7.6.14#76016) >
[jira] [Created] (ARROW-6047) [Rust] Rust nightly 1.38.0 builds failing
Wes McKinney created ARROW-6047: --- Summary: [Rust] Rust nightly 1.38.0 builds failing Key: ARROW-6047 URL: https://issues.apache.org/jira/browse/ARROW-6047 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Wes McKinney Fix For: 1.0.0 see * https://travis-ci.org/apache/arrow/jobs/563893205 * https://ci.ursalabs.org/#/builders/93/builds/669/steps/2/logs/stdio -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Procedure for seeing through implementation of new Arrow format features
hi folks, In https://github.com/apache/arrow/pull/4921 Antoine has implemented binary and unicode type variants with 64-bit offsets (to permit payloads exceeding 2GB). I believe Micah plans to follow on with a Java implementation in a separate PR. We previously have discussed that we should ensure that we have multiple tested implementations of new Arrow columnar format features -- we have not discussed how such changes are to be sequenced. For the avoidance of doubt, my view on the procedure is that we should not _release_ the project containing additions to the protocol files without at least 2 independent implementations of those features (with integration tests, where relevant). I think it's OK for one implementation to land in master and then the other in a follow up PR. We'd like to get https://github.com/apache/arrow/pull/4921 merged soon (today or tomorrow if possible), so if anyone has a different opinion about how to stage new format implementations in a non-awkward way, could you please chime in? Thank you, Wes
[jira] [Created] (ARROW-6046) Slice RecordBatch of String array with offset 0
Sascha Hofmann created ARROW-6046: - Summary: Slice RecordBatch of String array with offset 0 Key: ARROW-6046 URL: https://issues.apache.org/jira/browse/ARROW-6046 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.14.1 Reporter: Sascha Hofmann We are seeing a very similar bug as in ARROW-809, just for a RecordBatch of strings. A slice of a RecordBatch with a string column and offset =0 returns the whole batch instead. {code:java} import pandas as pd import pyarrow as pa df = pd.DataFrame({ 'b': ['test' for x in range(1000_000)]}) tbl = pa.Table.from_pandas(df) batch = tbl.to_batches()[0] batch.slice(0,2).serialize().size # 4000232 batch.slice(1,2).serialize().size # 240 {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6045) Benchmark for Parquet float and NaN encoding/decoding
Itamar Turner-Trauring created ARROW-6045: - Summary: Benchmark for Parquet float and NaN encoding/decoding Key: ARROW-6045 URL: https://issues.apache.org/jira/browse/ARROW-6045 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Itamar Turner-Trauring It is possible that one point Parquet NaN encoding was slower, so it's worth extending the benchmarks for Parquet to do floats in general and NaNs in particular. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6044) Pyarrow HDFS client gets hung after a while
Fred Tzeng created ARROW-6044: - Summary: Pyarrow HDFS client gets hung after a while Key: ARROW-6044 URL: https://issues.apache.org/jira/browse/ARROW-6044 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.13.0 Environment: hadoop-3.0.3 driver='libhdfs' python 3.6 Centos7 Reporter: Fred Tzeng I'm using the pyarrow HDFS client in a long running (forever) app that makes connections to HDFS as external requests come in and destroys the connection as soon as the request is handled. This happens a large amount of times on separate threads and everything works great. The problem is, after the app idles for a while (perhaps hours) and no HDFS connections are made during this time, when the next connection is attempted, the API hdfs.connect(...) just hangs. No exceptions are thrown. Code snippet on what i'm doing to instantiate each connection: ... hdfs = pyarrow.hdfs.connect(self.hdfs_authority, self.hdfs_port, user=self.hdfs_user) try: //Do something finally: hdfs.close Any help on what might be causing these hangs is appreciated -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
> > It's not just computation libraries, it's any library peeking inside > Arrow data. Currently, the Arrow data types are simple, which makes it > easy and non-intimidating to build data processing utilities around > them. If we start adding sophisticated encodings, we also raise the > cost of supporting Arrow for third-party libraries. This is another legitimate concern about complexity. To try to limit complexity. I simplified the proposal PR [1] to only have 1 buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding scheme (RLE) that I think will have the most benefit if exploited properly. Compression is removed. I'd like to get closure on the proposal one way or another. I think now the question to be answered is if we are willing to introduce the additional complexity for the performance improvements they can yield? Is there more data that people would like to see that would influence their decision? Thanks, Micah [1] https://github.com/apache/arrow/pull/4815 On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou wrote: > On Mon, 22 Jul 2019 08:40:08 -0700 > Brian Hulette wrote: > > To me, the most important aspect of this proposal is the addition of > sparse > > encodings, and I'm curious if there are any more objections to that > > specifically. So far I believe the only one is that it will make > > computation libraries more complicated. This is absolutely true, but I > > think it's worth that cost. > > It's not just computation libraries, it's any library peeking inside > Arrow data. Currently, the Arrow data types are simple, which makes it > easy and non-intimidating to build data processing utilities around > them. If we start adding sophisticated encodings, we also raise the > cost of supporting Arrow for third-party libraries. > > Regards > > Antoine. > > >