Re: Datasets and Java

2019-11-27 Thread Hongze Zhang
Thanks for referencing this, Antoine. The concepts and principles seem to be pretty concrete so I may take some time to read it in detail. BTW I noticed that by the current discussion in ticket ARROW-7272[1] it's unlikely clear whether this one or ipc flatbuffers could be a better approach for J

Re: PyArrow.Table schema.metadata issue

2019-11-27 Thread Maarten Ballintijn
Hi Aaron, The schema is immutable, add_metadata returns a new schema object which includes the metadata. So I think this does what you want: schema = schema.add_metadata(meta) If not, experts will chime in hopefully. Cheers, Maarten. > On Nov 28, 2019, at 12:41 AM, Aaron Chu wrote: > > De

Re: PyArrow.Table schema.metadata issue

2019-11-27 Thread Aaron Chu
Dear all, I need your help regarding the pyarrow.table.schema. I tried to create a schema and use with_metadata/add_metadata functions to add the metadata (a python dict) to the schema. However, nothing showed up when I run 'schema.metadata'. I can't get the metadata added to the schema. This is

Re: Datasets and Java

2019-11-27 Thread Ji Liu
Hi Francois, Thanks for the proposal and your effort. I made a simple JNI poc before for RecordBatch/VectorSchemaRoot interaction between Java and C++[1][2]. This may help a little. Thanks, Ji Liu [1] https://github.com/tianchen92/jni-poc-java [2] https://github.com/tianchen92/jni-poc-cpp

Re: Non-chunked large files / hdf5 support

2019-11-27 Thread Wes McKinney
hi, There have been a number of discussions over the years about on-disk pre-allocation strategies. No volunteers have implemented anything, though. Developing an HDF5 integration library with pre-allocation and buffer management utilities seems like a reasonable growth area for the project. The f

Re: [Discuss][FlightRPC] Extensions to Flight: "DoBidirectional"

2019-11-27 Thread David Li
Thanks for the feedback. I do think if we had explicitly embraced gRPC from the beginning, there are a lot of places where things could be made more ergonomic, including with the metadata fields. But it would also have locked out us of potential future transports. On another note: I hesitate to p

Re: Datasets and Java

2019-11-27 Thread Francois Saint-Jacques
Hello Hongze, The C++ implementation of dataset, notably Dataset, DataSource, DataSourceDiscovery, and Scanner classes are not ready/designed for distributed computing. They don't serialize and they reference by pointer all around, thus I highly doubt that you can implement parts in Java, and some

Re: Apache Arrow sync now

2019-11-27 Thread Francois Saint-Jacques
Attendees: - Micah Kornfield, Google - Praveen Kumar, Dremio - Todd Hendricks - François Saint-Jacques RStudio/Ursa Labs Subject - Bazel. Micah wants feedback on the PR. This first is aimed a developer productivity, notably shorter link time and sandboxed build. As a first PoC, parts of the python

[jira] [Created] (ARROW-7272) [C++][Java] JNI bridge between RecordBatch and VectorSchemaRoot

2019-11-27 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-7272: - Summary: [C++][Java] JNI bridge between RecordBatch and VectorSchemaRoot Key: ARROW-7272 URL: https://issues.apache.org/jira/browse/ARROW-7272 Proje

Re: Strategy for mixing large_string and string with chunked arrays

2019-11-27 Thread Wes McKinney
On Tue, Nov 26, 2019 at 9:40 AM Maarten Breddels wrote: > > Op di 26 nov. 2019 om 15:02 schreef Wes McKinney : > > > hi Maarten > > > > I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based > > on this. > > > > I think that normalizing to a common type (which would require castin

Re: [DISCUSS][C++/Python] Bazel example

2019-11-27 Thread Micah Kornfield
> > I don't get how this is a cycle. It only means Bazel is too limited to > distinguish between a header dependency and a C++ module? Agreed, this isn't a true cycle, but bazel is opinionated about this (i.e. forces workarounds). In the example I highlighted it might have been cleaner to take

Re: [Discuss][FlightRPC] Extensions to Flight: "DoBidirectional"

2019-11-27 Thread Jacques Nadeau
Fair enough. I'm okay with the bytes approach and the proposal looks good to me. On Fri, Nov 8, 2019 at 11:37 AM David Li wrote: > I've updated the proposal. > > On the subject of Protobuf Any vs bytes, and how to handle > errors/metadata, I still think using bytes is preferable: > - It doesn't

Apache Arrow sync now

2019-11-27 Thread Wes McKinney
https://meet.google.com/vtm-teks-phx I'm unable to join on account of the Thanksgiving holiday, but others are welcome to discuss and share call notes after

Re: [DISCUSS][C++/Python] Bazel example

2019-11-27 Thread Antoine Pitrou
Le 27/11/2019 à 06:16, Micah Kornfield a écrit : > >> Can you give an example of circular dependency? Can this be solved by >> having more "type_fwd.h" headers for forward declarations of opaque types? > > I think the type_fwd.h might contribute to the problem. The solution would > be more gr

Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-27-0

2019-11-27 Thread Krisztián Szűcs
The flight compilation error occurring in the Conda builds are caused by a recent protobuf conda-forge update and should be fixed by https://github.com/apache/arrow/pull/5917 On Wed, Nov 27, 2019 at 2:01 PM Crossbow wrote: > > Arrow Build Report for Job nightly-2019-11-27-0 > > All tasks: > http

[jira] [Created] (ARROW-7271) [C++][Flight] Use the single parameter version of SetTotalBytesLimit

2019-11-27 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7271: -- Summary: [C++][Flight] Use the single parameter version of SetTotalBytesLimit Key: ARROW-7271 URL: https://issues.apache.org/jira/browse/ARROW-7271 Project: Apach

[NIGHTLY] Arrow Build Report for Job nightly-2019-11-27-0

2019-11-27 Thread Crossbow
Arrow Build Report for Job nightly-2019-11-27-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0 Failed Tasks: - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-homebrew-cpp - test-conda-cpp: U

Re: Datasets and Java

2019-11-27 Thread Antoine Pitrou
To set up bridges between Java and C++, the C data interface specification may help: https://github.com/apache/arrow/pull/5442 There's an implementation for C++ here, and it also includes a Python-R bridge able to share Arrow data between two different runtimes (i.e. PyArrow and R-Arrow were com

Re: Datasets and Java

2019-11-27 Thread Hongze Zhang
Hi Micah, Regarding our use cases, we'd use the API on Parquet files with some pushed filters and projectors, and we'd extend the C++ Datasets code to provide necessary support for our own data formats. > If JNI is seen as too cumbersome, another possible avenue to pursue is > writing a gRPC

[jira] [Created] (ARROW-7269) [C++] Fix arrow::parquet compiler warning

2019-11-27 Thread Jiajia Li (Jira)
Jiajia Li created ARROW-7269: Summary: [C++] Fix arrow::parquet compiler warning Key: ARROW-7269 URL: https://issues.apache.org/jira/browse/ARROW-7269 Project: Apache Arrow Issue Type: Improvemen

[jira] [Created] (ARROW-7270) [Go] preserve CSV reading behaviour, improve memory usage

2019-11-27 Thread Sebastien Binet (Jira)
Sebastien Binet created ARROW-7270: -- Summary: [Go] preserve CSV reading behaviour, improve memory usage Key: ARROW-7270 URL: https://issues.apache.org/jira/browse/ARROW-7270 Project: Apache Arrow

Re: Datasets and Java

2019-11-27 Thread Micah Kornfield
Hi Hongze, I have a strong preference for not porting non-trivial logic from one language to another, especially if the main goal is performance. I think this will replicate bugs and cause confusion if inconsistencies occur. It is also a non-trivial amount of work to develop, review, setup CI, et