Re: Datasets and Java

2019-11-27 Thread Hongze Zhang
Thanks for referencing this, Antoine. The concepts and principles seem to be 
pretty concrete so I
may take some time to read it in detail.

BTW I noticed that by the current discussion in ticket ARROW-7272[1] it's 
unlikely clear whether
this one or ipc flatbuffers could be a better approach for Java/C++ 
interchange. Isn't it?

Best,
Hongze

[1] https://issues.apache.org/jira/browse/ARROW-7272



On Wed, 2019-11-27 at 11:19 +0100, Antoine Pitrou wrote:
> To set up bridges between Java and C++, the C data interface
> specification may help:
> https://github.com/apache/arrow/pull/5442
> 
> There's an implementation for C++ here, and it also includes a Python-R
> bridge able to share Arrow data between two different runtimes (i.e.
> PyArrow and R-Arrow were compiled potentially using different
> toolchains, with different ABIs):
> https://github.com/apache/arrow/pull/5608
> 
> Regards
> 
> Antoine.
> 
> 
> 
> Le 27/11/2019 à 11:16, Hongze Zhang a écrit :
> > Hi Micah,
> > 
> > 
> > Regarding our use cases, we'd use the API on Parquet files with some pushed 
> > filters and
> > projectors, and we'd extend the C++ Datasets code to provide necessary 
> > support for our own data
> > formats.
> > 
> > 
> > > If JNI is seen as too cumbersome, another possible avenue to pursue is
> > > writing a gRPC wrapper around the DataSet metadata capabilities.  One 
> > > could
> > > then create a facade on top of that for Java.  For data reads, I can see
> > > either building a Flight server or directly use the JNI readers.
> > 
> > Thanks for your suggestion but I'm not entirely getting it. Does this mean 
> > to start some
> > individual gRPC/Flight server process to deal with the metadata/data 
> > exchange problem between
> > Java and C++ Datasets? If yes, then in some cases, doesn't it easily 
> > introduce bigger problems
> > about life cycle and resource management of the processes? Please correct 
> > me if I misunderstood
> > your point.
> > 
> > 
> > And IMHO I don't strongly hate the possible inconsistencies and bugs bought 
> > by a Java porting of
> > something like the Datasets framework. Inconsistencies are usually in a way 
> > inevitable between
> > two different languages' implementations of the same component, but there 
> > is supposed to be a
> > trade-off based on whether the implementations arre worth to be provided. I 
> > didn't have chance
> > to fully investigate the requirements of Datasets-Java from other projects 
> > so I'm not 100% sure
> > but the functionality such as source discovery, predicate pushdown, 
> > multi-format support could
> > be powerful for many scenarios. Anyway I'm totally with you that the work 
> > amount could be huge
> > and bugs might be brought. So my goal it to start from a small piece of the 
> > APIs to minimize the
> > initial work. What do you think?
> > 
> > 
> > Thanks,
> > Hongze
> > 
> > 
> > 
> > At 2019-11-27 16:00:35, "Micah Kornfield"  wrote:
> > > Hi Hongze,
> > > I have a strong preference for not porting non-trivial logic from one
> > > language to another, especially if the main goal is performance.  I think
> > > this will replicate bugs and cause confusion if inconsistencies occur.  It
> > > is also a non-trivial amount of work to develop, review, setup CI, etc.
> > > 
> > > If JNI is seen as too cumbersome, another possible avenue to pursue is
> > > writing a gRPC wrapper around the DataSet metadata capabilities.  One 
> > > could
> > > then create a facade on top of that for Java.  For data reads, I can see
> > > either building a Flight server or directly use the JNI readers.
> > > 
> > > In either case this is a non-trivial amount of work, so I at least,
> > > would appreciate a short write-up (1-2 pages) explicitly stating
> > > goals/use-cases for the library and a high level design (component 
> > > overview
> > > and relationships between components and how it will co-exist with 
> > > existing
> > > Java code).  If I understand correctly, one goal is to use this as a basis
> > > for a new Spark DataSet API with better performance than the vectorized
> > > spark parquet reader?  Are there others?
> > > 
> > > Wes, what are your thoughts on this?
> > > 
> > > Thanks,
> > > Micah
> > > 
> > > 
> > > On Tue, Nov 26, 2019 at 10:51 PM Hongze Zhang  wrote:
> > > 
> > > > Hi Wes and Micah,
> > > > 
> > > > 
> > > > Thanks for your kindly reply.
> > > > 
> > > > 
> > > > Micah: We don't use Spark (vectorized) parquet reader because it is a 
> > > > pure
> > > > Java implementation. Performance could be worse than doing the similar 
> > > > work
> > > > natively. Another reason is we may need to
> > > > integrate some other specific data sources with Arrow datasets, for
> > > > limiting the workload, we would like to maintain a common read pipeline 
> > > > for
> > > > both this one and other wildly used data sources like Parquet and Csv.
> > > > 
> > > > 
> > > > Wes: Yes, Datasets framework along with Parquet/CSV/... reader
> > > > 

Re: PyArrow.Table schema.metadata issue

2019-11-27 Thread Aaron Chu
Dear all,

I need your help regarding the pyarrow.table.schema.

I tried to create a schema and use with_metadata/add_metadata functions to
add the metadata (a python dict) to the schema. However, nothing showed up
when I run 'schema.metadata'. I can't get the metadata added to the schema.

This issue can be easily reproduced on python2 and 3:

import pyarrow as pa
schema = pa.schema([pa.field('Event_ID', pa.int64())])
meta = {}
meta['test'] = 'testval'
schema.add_metadata(meta)
#schema.with_metadata(meta)
schema.metadata

Thanks for your help!!

Best Regards,
Aaron Chu


On Wed, Nov 27, 2019 at 9:37 PM Aaron Chu  wrote:

> Dear all,
>
> I need your help regarding the pyarrow.table.schema.
>
> I tried to create a schema and use with_metadata/add_metadata functions to
> add the metadata (a python dict) to the schema. However, nothing showed up
> when I run 'schema.metadata'. I can't get the metadata added to the schema.
>
> This issue can be easily reproduced on python2 and 3:
>
> import pyarrow as pa
> schema = pa.schema([pa.field('Event_ID', pa.int64())])
> meta = {}
> meta['test'] = 'testval'
> schema.add_metadata(meta)
> #schema.with_metadata(meta)
> schema.metadata
>
> Thanks for your help!!
>
> Best Regards,
> Aaron Chu
>


Re: Datasets and Java

2019-11-27 Thread Ji Liu
Hi Francois, 

Thanks for the proposal and your effort.
I made a simple JNI poc before for RecordBatch/VectorSchemaRoot interaction 
between Java and C++[1][2].
This may help a little.


Thanks,
Ji Liu


[1] https://github.com/tianchen92/jni-poc-java
[2] https://github.com/tianchen92/jni-poc-cpp




--
From:Francois Saint-Jacques 
Send Time:2019年11月28日(星期四) 05:08
To:dev 
Subject:Re: Datasets and Java

Hello Hongze,

The C++ implementation of dataset, notably Dataset, DataSource,
DataSourceDiscovery, and Scanner classes are not ready/designed for
distributed computing. They don't serialize and they reference by
pointer all around, thus I highly doubt that you can implement parts
in Java, and some in C++ with minimal effort and complexity. You can
think of Dataset/DataSource as similar to the Hive Metastore, but
locally (single node) and in-memory. I fail to see how one could use
it with the execution model of spark, e.g. construct all the manifests
on the driver via Dataset, Scanner and pass the ScanTask to executors
due to previous limitations. One cannot construct a ScanTask out of
thin air, it needs a DataFragment (or FileFormat in case of
FileDataFragment).

Having said that, I think I understand where you want to go. The
FileFormat::ScanFile method has what you want without the overhead of
the full dataset API. It acts as an interface to interact with file
format paired with predicate pushdown and column selection options.
This is where I would start:

- Create a JNI bridge between a C++ RecordBatch and Java VectorSchemaRoot [1]
- Create a C++ helper method `Result>
ScanFile(FileSource source, FileFormat& fmt,
std::shared_ptr options, std::shared_ptr
context)`
The goal of this method is similar to `Scanner::ToTable`, i.e. hide
the local scheduling details of ScanTask. Thus you don't need to
expose ScanTask.
- Create a JNI binding to the previous helper and all the class
dependencies to construct the parameters (FileSource, FileFormat,
ScanOptions, ScanContext).
This is where it gets cumbersome, ScanOptions has Expression which may
not be easy to build ad-hoc. FileSource needs a fs::Filesystem,
ScanContext needs a MemoryPool, etc... You may hide this via helper
methods, this is what the R binding is doing.

Your PoC can probably get away with a trivial
`Result> ScanParquetFile(std::string path, Expr&
filter, std::vector columns)` without exposing all the
details and using the "defaults". Thus you only need to wrap a method
(ScanParquetFile) and Expression in your JNI bridge.

Pros:
- Access to native file readers with uniform predicate pushdown (when
the file format supports it), and column selection options. Filtering
done natively in C++.
- Enable usage of said points in distributed computing, since the only
passed information are: the path, the expression (will need a
translation), and the list of columns. All of which are tracktable to
serialize.
- Bonus, you may even get transparent access to gandiva [2]

Cons:
- No predicate pushdown on file partition, e.g. extracted from path
because this information is in the DataSource
- ScanOptions is built by ScannerBuilder, there's a lot of validation
hidden under the hood via DataSource, DataSourceDiscovery and
ScannerBuilder. It's easy to get an error with a malformed
ScanOptions.
- No access to non-file DataSource, e.g. in the future we might have
OdbcDataSource and FlightDataSource

Basically, dataset::FileFormat is meant to be a unified interface to
interact with file formats. Here's an example of such usage without
all the dataset machinery [3].

François

[1] https://issues.apache.org/jira/browse/ARROW-7272
[2] https://issues.apache.org/jira/browse/ARROW-6953
[3] 
https://github.com/apache/arrow/blob/61c8b1b80039119d5905660289dd53a3130ce898/cpp/src/arrow/dataset/file_parquet_test.cc#L345-L393










On Wed, Nov 27, 2019 at 5:17 AM Hongze Zhang  wrote:
>
> Hi Micah,
>
>
> Regarding our use cases, we'd use the API on Parquet files with some pushed 
> filters and projectors, and we'd extend the C++ Datasets code to provide 
> necessary support for our own data formats.
>
>
> > If JNI is seen as too cumbersome, another possible avenue to pursue is
> > writing a gRPC wrapper around the DataSet metadata capabilities.  One could
> > then create a facade on top of that for Java.  For data reads, I can see
> > either building a Flight server or directly use the JNI readers.
>
>
> Thanks for your suggestion but I'm not entirely getting it. Does this mean to 
> start some individual gRPC/Flight server process to deal with the 
> metadata/data exchange problem between Java and C++ Datasets? If yes, then in 
> some cases, doesn't it easily introduce bigger problems about life cycle and 
> resource management of the processes? Please correct me if I misunderstood 
> your point.
>
>
> And IMHO I don't strongly hate the possible inconsistencies and bugs bought 
> by a Java porting of something like 

Re: Datasets and Java

2019-11-27 Thread Francois Saint-Jacques
Hello Hongze,

The C++ implementation of dataset, notably Dataset, DataSource,
DataSourceDiscovery, and Scanner classes are not ready/designed for
distributed computing. They don't serialize and they reference by
pointer all around, thus I highly doubt that you can implement parts
in Java, and some in C++ with minimal effort and complexity. You can
think of Dataset/DataSource as similar to the Hive Metastore, but
locally (single node) and in-memory. I fail to see how one could use
it with the execution model of spark, e.g. construct all the manifests
on the driver via Dataset, Scanner and pass the ScanTask to executors
due to previous limitations. One cannot construct a ScanTask out of
thin air, it needs a DataFragment (or FileFormat in case of
FileDataFragment).

Having said that, I think I understand where you want to go. The
FileFormat::ScanFile method has what you want without the overhead of
the full dataset API. It acts as an interface to interact with file
format paired with predicate pushdown and column selection options.
This is where I would start:

- Create a JNI bridge between a C++ RecordBatch and Java VectorSchemaRoot [1]
- Create a C++ helper method `Result>
ScanFile(FileSource source, FileFormat& fmt,
std::shared_ptr options, std::shared_ptr
context)`
The goal of this method is similar to `Scanner::ToTable`, i.e. hide
the local scheduling details of ScanTask. Thus you don't need to
expose ScanTask.
- Create a JNI binding to the previous helper and all the class
dependencies to construct the parameters (FileSource, FileFormat,
ScanOptions, ScanContext).
This is where it gets cumbersome, ScanOptions has Expression which may
not be easy to build ad-hoc. FileSource needs a fs::Filesystem,
ScanContext needs a MemoryPool, etc... You may hide this via helper
methods, this is what the R binding is doing.

Your PoC can probably get away with a trivial
`Result> ScanParquetFile(std::string path, Expr&
filter, std::vector columns)` without exposing all the
details and using the "defaults". Thus you only need to wrap a method
(ScanParquetFile) and Expression in your JNI bridge.

Pros:
- Access to native file readers with uniform predicate pushdown (when
the file format supports it), and column selection options. Filtering
done natively in C++.
- Enable usage of said points in distributed computing, since the only
passed information are: the path, the expression (will need a
translation), and the list of columns. All of which are tracktable to
serialize.
- Bonus, you may even get transparent access to gandiva [2]

Cons:
- No predicate pushdown on file partition, e.g. extracted from path
because this information is in the DataSource
- ScanOptions is built by ScannerBuilder, there's a lot of validation
hidden under the hood via DataSource, DataSourceDiscovery and
ScannerBuilder. It's easy to get an error with a malformed
ScanOptions.
- No access to non-file DataSource, e.g. in the future we might have
OdbcDataSource and FlightDataSource

Basically, dataset::FileFormat is meant to be a unified interface to
interact with file formats. Here's an example of such usage without
all the dataset machinery [3].

François

[1] https://issues.apache.org/jira/browse/ARROW-7272
[2] https://issues.apache.org/jira/browse/ARROW-6953
[3] 
https://github.com/apache/arrow/blob/61c8b1b80039119d5905660289dd53a3130ce898/cpp/src/arrow/dataset/file_parquet_test.cc#L345-L393










On Wed, Nov 27, 2019 at 5:17 AM Hongze Zhang  wrote:
>
> Hi Micah,
>
>
> Regarding our use cases, we'd use the API on Parquet files with some pushed 
> filters and projectors, and we'd extend the C++ Datasets code to provide 
> necessary support for our own data formats.
>
>
> > If JNI is seen as too cumbersome, another possible avenue to pursue is
> > writing a gRPC wrapper around the DataSet metadata capabilities.  One could
> > then create a facade on top of that for Java.  For data reads, I can see
> > either building a Flight server or directly use the JNI readers.
>
>
> Thanks for your suggestion but I'm not entirely getting it. Does this mean to 
> start some individual gRPC/Flight server process to deal with the 
> metadata/data exchange problem between Java and C++ Datasets? If yes, then in 
> some cases, doesn't it easily introduce bigger problems about life cycle and 
> resource management of the processes? Please correct me if I misunderstood 
> your point.
>
>
> And IMHO I don't strongly hate the possible inconsistencies and bugs bought 
> by a Java porting of something like the Datasets framework. Inconsistencies 
> are usually in a way inevitable between two different languages' 
> implementations of the same component, but there is supposed to be a 
> trade-off based on whether the implementations arre worth to be provided. I 
> didn't have chance to fully investigate the requirements of Datasets-Java 
> from other projects so I'm not 100% sure but the functionality such as source 
> discovery, predicate pushdown, multi-format 

Re: Apache Arrow sync now

2019-11-27 Thread Francois Saint-Jacques
Attendees:
- Micah Kornfield, Google
- Praveen Kumar, Dremio
- Todd Hendricks
- François Saint-Jacques RStudio/Ursa Labs

Subject
- Bazel. Micah wants feedback on the PR. This first is aimed a
developer productivity, notably shorter link time and sandboxed build.
As a first PoC, parts of the python library can be built directly from
Bazel.
- Java/C++ bridge. Micah wanted to bring the subject JNI bridge and
other methods to access C++ code in Java. Francois proposed that we
first create a jni PoC to pass RecordBatch from C++ to Java [1].

[1] https://issues.apache.org/jira/browse/ARROW-7272

On Wed, Nov 27, 2019 at 12:01 PM Wes McKinney  wrote:
>
> https://meet.google.com/vtm-teks-phx
>
> I'm unable to join on account of the Thanksgiving holiday, but others
> are welcome to discuss and share call notes after


[jira] [Created] (ARROW-7272) [C++][Java] JNI bridge between RecordBatch and VectorSchemaRoot

2019-11-27 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-7272:
-

 Summary: [C++][Java] JNI bridge between RecordBatch and 
VectorSchemaRoot
 Key: ARROW-7272
 URL: https://issues.apache.org/jira/browse/ARROW-7272
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Java
Reporter: Francois Saint-Jacques


Given a C++ std::shared_ptr, retrieve it in java as a 
VectorSchemaRoot class. Gandiva already offer a similar facility but with raw 
buffers. It would be convenient if users could call C++ that yields RecordBatch 
and retrieve it in a seamless fashion.

This would remove one roadblock of using C++ dataset facility in Java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Strategy for mixing large_string and string with chunked arrays

2019-11-27 Thread Wes McKinney
On Tue, Nov 26, 2019 at 9:40 AM Maarten Breddels
 wrote:
>
> Op di 26 nov. 2019 om 15:02 schreef Wes McKinney :
>
> > hi Maarten
> >
> > I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based
> > on this.
> >
> > I think that normalizing to a common type (which would require casting
> > the offsets buffer, but not the data -- which can be shared -- so not
> > too wasteful) during concatenation would be the approach I would take.
> > I would be surprised if normalizing string offsets during record batch
> > / table concatenation showed up as a performance or memory use issue
> > relative to other kinds of operations -- in theory the
> > string->large_string promotion should be relatively exceptional (< 5%
> > of the time). I've found in performance tests that creating many
> > smaller array chunks is faster anyway due to interplay with the memory
> > allocator.
> >
>
> Yes, I think it is rare, but it does mean that if a user wants to convert a
> Vaex dataframe to an Arrow table it might use GB's of RAM (thinking ~1
> billion rows). Ideally, it would use zero RAM (imagine concatenating many
> large memory-mapped datasets).
> I'm ok living with this limitation, but I wanted to raise it before v1.0
> goes out.
>

The 1.0 release is about hardening the format and protocol, which
wouldn't be affected by this discussion. The Binary/String and
LargeBinary/LargeString are distinct memory layouts and so they need
to be separate at the protocol level.

At the C++ library / application level there's plenty that could be
done if this turned out to be an issue. For example, an ExtensionType
could be defined that allows the storage to be either 32-bit or
64-bit.

>
>
> >
> > Of course I think we should have string kernels for both 32-bit and
> > 64-bit variants. Note that Gandiva already has significant string
> > kernel support (for 32-bit offsets at the moment) and there is
> > discussion about pre-compiling the LLVM IR into a shared library to
> > not introduce an LLVM runtime dependency, so we could maintain a
> > single code path for string algorithms that can be used both in a
> > JIT-ed setting as well as pre-compiled / interpreted setting. See
> > https://issues.apache.org/jira/browse/ARROW-7083
>
>
> That is a very interesting approach, thanks for sharing that resource, I'll
> consider that.
>
>
> > Note that many analytic database engines (notably: Dremio, which is
> > natively Arrow-based) don't support exceeding the 2GB / 32-bit limit
> > at all and it does not seem to be an impedance in practical use. We
> > have the Chunked* builder classes [1] in C++ to facilitate the
> > creation of chunked binary arrays where there is concern about
> > overflowing the 2GB limit.
> >
> > Others may have different opinions so I'll let them comment.
> >
>
> Yes, I think in many cases it's not a problem at all. Also in vaex, all the
> processing happens in chunks, and no chunk will ever be that large (for the
> near future...).
> In vaex, when exporting to hdf5, I always write in 1 chunk, and that's
> where most of my issues show up.

I see. Ideally one would architect around the chunked model since this
seems to have the best overall performance and scalability qualities.

>
> cheers,
>
> Maarten
>
>
> >
> > - Wes
> >
> > [1]:
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_binary.h#L510
> >
> > On Tue, Nov 26, 2019 at 7:44 AM Maarten Breddels
> >  wrote:
> > >
> > > Hi Arrow devs,
> > >
> > > Small intro: I'm the main Vaex developer, an out of core dataframe
> > > library for Python - https://github.com/vaexio/vaex -, and we're
> > > looking into moving Vaex to use Apache Arrow for the data structure.
> > > At the beginning of this year, we added string support in Vaex, which
> > > required 64 bit offsets. Those were not available back then, so we
> > > added our own data structure for string arrays. Our first step to move
> > > to Apache Arrow is to see if we can use Arrow for the data structure,
> > > and later on, move the strings algorithms of Vaex to Arrow.
> > >
> > > (originally posted at https://github.com/apache/arrow/issues/5874)
> > >
> > > In vaex I can lazily concatenate dataframes without memory copy. If I
> > > want to implement this using a pa.ChunkedArray, users cannot
> > > concatenate dataframes that have a string column with pa.string type
> > > to a dataframe that has a column with pa.large_string.
> > >
> > > In short, there is no arrow data structure to handle this 'mixed
> > > chunked array', but I was wondering if this could change. The only way
> > > out seems to cast them manually to a common type (although blocked by
> > > https://issues.apache.org/jira/browse/ARROW-6071).
> > > Internally I could solve this in vaex, but feedback from building a
> > > DataFrame library with arrow might be useful. Also, it means I cannot
> > > expose the concatenated DataFrame as an arrow table.
> > >
> > > Because of this, I am wondering if having two types 

Re: [DISCUSS][C++/Python] Bazel example

2019-11-27 Thread Micah Kornfield
>
> I don't get how this is a cycle.  It only means Bazel is too limited to
> distinguish between a header dependency and a C++ module?


Agreed, this isn't a true cycle, but bazel is opinionated about this (i.e.
forces workarounds).   In the example I highlighted it might have been
cleaner to take the approach  combining the two ".cc" files and ".h" files
into a single bazel target.  Within Google, there is a fairly strong
convention of 1 ".h" and ".cc" per build target.


> Do you mean that long compile times are ok because we can ask
> contributors to buy 16-core monsters?


No, this was my poor attempt at humor.  I apologize if it offended you or
anyone else.  The hardware I use for my Arrow development is old enough
that I've just started accepting slow build times.

Getting back to potentially merging this, we discussed on bazel on the sync
call.  One option is to not add this to the Arrow CI builds and let Google
projects that depend on the binding be responsible for keeping it working.
This has the potential for bit-rot, but might be a good compromise and let
other developers try it out to see if they like it.

Cheers,
Micah

On Wed, Nov 27, 2019 at 6:52 AM Antoine Pitrou  wrote:

>
> Le 27/11/2019 à 06:16, Micah Kornfield a écrit :
> >
> >>  Can you give an example of circular dependency?  Can this be solved by
> >> having more "type_fwd.h" headers for forward declarations of opaque
> types?
> >
> > I think the type_fwd.h might contribute to the problem. The solution
> would
> > be more granular header/compilation units when possible (or combining
> > targets appropriately).  An example of the problem is expression.h/.cc
> and
> > operation.h/.cc in the compute library.  Because operation.cc depends on
> > expression.h and expression.cc relies on expression.h there is cycle
> > between the two targets.
>
> I don't get how this is a cycle.  It only means Bazel is too limited to
> distinguish between a header dependency and a C++ module?
>
> For me, a cycle would be something like "expression.h includes
> operation.h which includes expression.h" (I've actually already seen
> things like this, though not in Arrow AFAIR).
>
> > I thought computer
> > upgrades where something to look forward to ;)
>
> Do you mean that long compile times are ok because we can ask
> contributors to buy 16-core monsters?
>
> Regards
>
> Antoine.
>


Re: [Discuss][FlightRPC] Extensions to Flight: "DoBidirectional"

2019-11-27 Thread Jacques Nadeau
Fair enough. I'm okay with the bytes approach and the proposal looks good
to me.

On Fri, Nov 8, 2019 at 11:37 AM David Li  wrote:

> I've updated the proposal.
>
> On the subject of Protobuf Any vs bytes, and how to handle
> errors/metadata, I still think using bytes is preferable:
> - It doesn't require (conditionally) exposing or wrapping Protobuf types,
> - We wouldn't be able to practically expose the Protobuf field to C++
> users without causing build pains,
> - We can't let Python users take advantage of the Protobuf field
> without somehow being compatible with the Protobuf wheels (by linking
> to the same version, and doing magic to turn the C++ Protobufs into
> the Python ones),
> - All our other application-defined fields are already bytes.
>
> Applications that want structure can encode JSON or Protobuf Any into
> the bytes field themselves, much as you can already do for Ticket,
> commands in FlightDescriptors, and application metadata in
> DoGet/DoPut. I don't think this is (much) less efficient than using
> Any directly, since Any itself is a bytes field with a tag, and must
> invoke the Protobuf deserializer again to read the actual message.
>
> If we decide on using bytes, then I don't think it makes sense to
> define a new message with a oneof either, since it would be redundant.
>
> Thanks,
> David
>
> On 11/7/19, David Li  wrote:
> > I've been extremely backlogged, I will update the proposal when I get
> > a chance and reply here when done.
> >
> > Best,
> > David
> >
> > On 11/7/19, Wes McKinney  wrote:
> >> Bumping this discussion since a couple of weeks have passed. It seems
> >> there are still some questions here, could we summarize what are the
> >> alternatives along with any public API implications so we can try to
> >> render a decision?
> >>
> >> On Sat, Oct 26, 2019 at 7:19 PM David Li  wrote:
> >>>
> >>> Hi Wes,
> >>>
> >>> Responses inline:
> >>>
> >>> On Sat, Oct 26, 2019, 13:46 Wes McKinney  wrote:
> >>>
> >>> > On Mon, Oct 21, 2019 at 7:40 PM David Li 
> >>> > wrote:
> >>> > >
> >>> > > The question is whether to repurpose the existing FlightData
> >>> > > structure, and allow for the metadata field to be filled in and
> data
> >>> > > fields to be blank (as a control message), or to wrap the
> FlightData
> >>> > > structure in another structure that explicitly distinguishes
> between
> >>> > > control and data messages.
> >>> >
> >>> > I'm not super against having metadata-only FlightData with empty
> body.
> >>> > One question to consider is what changes (if any) would need to be
> >>> > made to public APIs in either scenario.
> >>> >
> >>>
> >>> We could leave DoGet/DoPut as-is for now, and allow empty data messages
> >>> in
> >>> the future. This would be a breaking change, but wouldn't change the
> >>> wire
> >>> format. I think the APIs could be changed backwards compatibly, though.
> >>>
> >>>
> >>>
> >>> > > The other question is how to handle the metadata fields. So far,
> >>> > > we've
> >>> > > used bytestring fields for application-defined data. This is
> >>> > > workable
> >>> > > if you want to use Protobuf to define the contents of those fields,
> >>> > > but requires you to pack/unpack your Protobuf into/from the
> >>> > > bytestring
> >>> > > field. If we instead used the Protobuf Any field, a dynamically
> >>> > > typed
> >>> > > field, this would be more convenient, but then we'd be exposing
> >>> > > Protobuf types. We could alternatively use a combination of a type
> >>> > > field and a bytestring field, mimicking what the Protobuf Any type
> >>> > > looks like on the wire. I'm not sure this is actually cleaner in
> any
> >>> > > of the language APIs, though.
> >>> >
> >>> > Leaving the deserialization of the app metadata to the particular
> >>> > Flight implementation seems on first principles like the most
> flexible
> >>> > thing, if Any is used, does that mean the metadata _must_ be a
> >>> > protobuf?
> >>> >
> >>>
> >>>
> >>> If Any is used, we could still expose a bytes-based API, but it would
> >>> have
> >>> some more wrapping. (We could put a ByteString in Any.) Then the
> >>> question
> >>> would just be how to expose this (would be easier in Java, harder in
> >>> C++).
> >>>
> >>>
> >>>
> >>> > > David
> >>> > >
> >>> > > On 10/21/19, Antoine Pitrou  wrote:
> >>> > > >
> >>> > > > Can one of you explain what is being proposed in non-protobuf
> >>> > > > terms?
> >>> > > > Knowledge of protobuf shouldn't be required to use Flight.
> >>> > > >
> >>> > > > Regards
> >>> > > >
> >>> > > > Antoine.
> >>> > > >
> >>> > > >
> >>> > > > Le 21/10/2019 à 15:46, David Li a écrit :
> >>> > > >> Oneof doesn't actually change the wire encoding; it would just
> be
> >>> > > >> application-level logic. (The official guide doesn't even
> mention
> >>> > > >> it
> >>> > > >> in the encoding docs; I found
> >>> > > >>
> >>> >
> https://stackoverflow.com/questions/52226409/how-protobuf-encodes-oneof-message-construct
> >>> > > >> as well.)
> >>> > > >>
> 

Apache Arrow sync now

2019-11-27 Thread Wes McKinney
https://meet.google.com/vtm-teks-phx

I'm unable to join on account of the Thanksgiving holiday, but others
are welcome to discuss and share call notes after


Re: [DISCUSS][C++/Python] Bazel example

2019-11-27 Thread Antoine Pitrou


Le 27/11/2019 à 06:16, Micah Kornfield a écrit :
> 
>>  Can you give an example of circular dependency?  Can this be solved by
>> having more "type_fwd.h" headers for forward declarations of opaque types?
> 
> I think the type_fwd.h might contribute to the problem. The solution would
> be more granular header/compilation units when possible (or combining
> targets appropriately).  An example of the problem is expression.h/.cc and
> operation.h/.cc in the compute library.  Because operation.cc depends on
> expression.h and expression.cc relies on expression.h there is cycle
> between the two targets.

I don't get how this is a cycle.  It only means Bazel is too limited to
distinguish between a header dependency and a C++ module?

For me, a cycle would be something like "expression.h includes
operation.h which includes expression.h" (I've actually already seen
things like this, though not in Arrow AFAIR).

> I thought computer
> upgrades where something to look forward to ;)

Do you mean that long compile times are ok because we can ask
contributors to buy 16-core monsters?

Regards

Antoine.


Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-27-0

2019-11-27 Thread Krisztián Szűcs
The flight compilation error occurring in the Conda builds
are caused by a recent protobuf conda-forge update and
should be fixed by https://github.com/apache/arrow/pull/5917

On Wed, Nov 27, 2019 at 2:01 PM Crossbow  wrote:

>
> Arrow Build Report for Job nightly-2019-11-27-0
>
> All tasks:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0
>
> Failed Tasks:
> - homebrew-cpp:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-homebrew-cpp
> - test-conda-cpp:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-cpp
> - test-conda-python-2.7-pandas-latest:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-2.7-pandas-latest
> - test-conda-python-2.7:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-2.7
> - test-conda-python-3.6:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.6
> - test-conda-python-3.7-dask-latest:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7-dask-latest
> - test-conda-python-3.7-pandas-latest:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7-pandas-latest
> - test-conda-python-3.7-pandas-master:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7-pandas-master
> - test-conda-python-3.7-spark-master:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7-spark-master
> - test-conda-python-3.7-turbodbc-latest:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7-turbodbc-latest
> - test-conda-python-3.7-turbodbc-master:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7-turbodbc-master
> - test-conda-python-3.7:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7
> - test-conda-python-3.8-dask-master:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.8-dask-master
> - test-conda-python-3.8-pandas-latest:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.8-pandas-latest
> - test-debian-10-rust-nightly-2019-09-25:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-debian-10-rust-nightly-2019-09-25
> - wheel-manylinux1-cp27m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux1-cp27m
> - wheel-manylinux1-cp27mu:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux1-cp27mu
> - wheel-manylinux1-cp35m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux1-cp35m
> - wheel-manylinux1-cp36m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux1-cp36m
> - wheel-manylinux1-cp37m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux1-cp37m
> - wheel-manylinux2010-cp27m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux2010-cp27m
> - wheel-manylinux2010-cp27mu:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux2010-cp27mu
> - wheel-manylinux2010-cp35m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux2010-cp35m
> - wheel-manylinux2010-cp36m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux2010-cp36m
> - wheel-manylinux2010-cp37m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux2010-cp37m
>
> Succeeded Tasks:
> - centos-6:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-azure-centos-6
> - centos-7:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-azure-centos-7
> - centos-8:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-azure-centos-8
> - conda-linux-gcc-py27:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-azure-conda-linux-gcc-py27
> - conda-linux-gcc-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-azure-conda-linux-gcc-py36
> - 

[jira] [Created] (ARROW-7271) [C++][Flight] Use the single parameter version of SetTotalBytesLimit

2019-11-27 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7271:
--

 Summary: [C++][Flight] Use the single parameter version of 
SetTotalBytesLimit
 Key: ARROW-7271
 URL: https://issues.apache.org/jira/browse/ARROW-7271
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


With the recent protobuf update on CF deprecated error is triggered during 
compilation.
See build error https://app.circleci.com/jobs/github/ursa-labs/crossbow/5418



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2019-11-27-0

2019-11-27 Thread Crossbow


Arrow Build Report for Job nightly-2019-11-27-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0

Failed Tasks:
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-homebrew-cpp
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-cpp
- test-conda-python-2.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-2.7-pandas-latest
- test-conda-python-2.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-2.7
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7-spark-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.7
- test-conda-python-3.8-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.8-dask-master
- test-conda-python-3.8-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-conda-python-3.8-pandas-latest
- test-debian-10-rust-nightly-2019-09-25:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-circle-test-debian-10-rust-nightly-2019-09-25
- wheel-manylinux1-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux1-cp27m
- wheel-manylinux1-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux1-cp27mu
- wheel-manylinux1-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux1-cp35m
- wheel-manylinux1-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux1-cp36m
- wheel-manylinux1-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux1-cp37m
- wheel-manylinux2010-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux2010-cp27m
- wheel-manylinux2010-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux2010-cp27mu
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux2010-cp35m
- wheel-manylinux2010-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux2010-cp36m
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-travis-wheel-manylinux2010-cp37m

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-azure-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-azure-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-azure-centos-8
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-azure-conda-linux-gcc-py27
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-azure-conda-linux-gcc-py37
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-27-0-azure-conda-osx-clang-py27
- conda-osx-clang-py36:
  URL: 

Re: Datasets and Java

2019-11-27 Thread Antoine Pitrou


To set up bridges between Java and C++, the C data interface
specification may help:
https://github.com/apache/arrow/pull/5442

There's an implementation for C++ here, and it also includes a Python-R
bridge able to share Arrow data between two different runtimes (i.e.
PyArrow and R-Arrow were compiled potentially using different
toolchains, with different ABIs):
https://github.com/apache/arrow/pull/5608

Regards

Antoine.



Le 27/11/2019 à 11:16, Hongze Zhang a écrit :
> Hi Micah,
> 
> 
> Regarding our use cases, we'd use the API on Parquet files with some pushed 
> filters and projectors, and we'd extend the C++ Datasets code to provide 
> necessary support for our own data formats.
> 
> 
>> If JNI is seen as too cumbersome, another possible avenue to pursue is
>> writing a gRPC wrapper around the DataSet metadata capabilities.  One could
>> then create a facade on top of that for Java.  For data reads, I can see
>> either building a Flight server or directly use the JNI readers.
> 
> 
> Thanks for your suggestion but I'm not entirely getting it. Does this mean to 
> start some individual gRPC/Flight server process to deal with the 
> metadata/data exchange problem between Java and C++ Datasets? If yes, then in 
> some cases, doesn't it easily introduce bigger problems about life cycle and 
> resource management of the processes? Please correct me if I misunderstood 
> your point.
> 
> 
> And IMHO I don't strongly hate the possible inconsistencies and bugs bought 
> by a Java porting of something like the Datasets framework. Inconsistencies 
> are usually in a way inevitable between two different languages' 
> implementations of the same component, but there is supposed to be a 
> trade-off based on whether the implementations arre worth to be provided. I 
> didn't have chance to fully investigate the requirements of Datasets-Java 
> from other projects so I'm not 100% sure but the functionality such as source 
> discovery, predicate pushdown, multi-format support could be powerful for 
> many scenarios. Anyway I'm totally with you that the work amount could be 
> huge and bugs might be brought. So my goal it to start from a small piece of 
> the APIs to minimize the initial work. What do you think?
> 
> 
> Thanks,
> Hongze
> 
> 
> 
> At 2019-11-27 16:00:35, "Micah Kornfield"  wrote:
>> Hi Hongze,
>> I have a strong preference for not porting non-trivial logic from one
>> language to another, especially if the main goal is performance.  I think
>> this will replicate bugs and cause confusion if inconsistencies occur.  It
>> is also a non-trivial amount of work to develop, review, setup CI, etc.
>>
>> If JNI is seen as too cumbersome, another possible avenue to pursue is
>> writing a gRPC wrapper around the DataSet metadata capabilities.  One could
>> then create a facade on top of that for Java.  For data reads, I can see
>> either building a Flight server or directly use the JNI readers.
>>
>> In either case this is a non-trivial amount of work, so I at least,
>> would appreciate a short write-up (1-2 pages) explicitly stating
>> goals/use-cases for the library and a high level design (component overview
>> and relationships between components and how it will co-exist with existing
>> Java code).  If I understand correctly, one goal is to use this as a basis
>> for a new Spark DataSet API with better performance than the vectorized
>> spark parquet reader?  Are there others?
>>
>> Wes, what are your thoughts on this?
>>
>> Thanks,
>> Micah
>>
>>
>> On Tue, Nov 26, 2019 at 10:51 PM Hongze Zhang  wrote:
>>
>>> Hi Wes and Micah,
>>>
>>>
>>> Thanks for your kindly reply.
>>>
>>>
>>> Micah: We don't use Spark (vectorized) parquet reader because it is a pure
>>> Java implementation. Performance could be worse than doing the similar work
>>> natively. Another reason is we may need to
>>> integrate some other specific data sources with Arrow datasets, for
>>> limiting the workload, we would like to maintain a common read pipeline for
>>> both this one and other wildly used data sources like Parquet and Csv.
>>>
>>>
>>> Wes: Yes, Datasets framework along with Parquet/CSV/... reader
>>> implementations are totally native, So a JNI bridge will be needed then we
>>> don't actually read files in Java.
>>>
>>>
>>> My another concern is how many C++ datasets components should be bridged
>>> via JNI. For example,
>>> bridge the ScanTask only? Or bridge more components including Scanner,
>>> Table, even the DataSource
>>> discovery system? Or just bridge the C++ arrow Parquet, Orc readers (as
>>> Micah said, orc-jni is
>>> already there) and reimplement everything needed by datasets in Java? This
>>> might be not that easy to
>>> decide but currently based on my limited perspective I would prefer to get
>>> started from the ScanTask
>>> layer as a result we could leverage some valuable work finished in C++
>>> datasets and don't have to
>>> maintain too much tedious JNI code. The real IO process still take place
>>> 

Re: Datasets and Java

2019-11-27 Thread Hongze Zhang
Hi Micah,


Regarding our use cases, we'd use the API on Parquet files with some pushed 
filters and projectors, and we'd extend the C++ Datasets code to provide 
necessary support for our own data formats.


> If JNI is seen as too cumbersome, another possible avenue to pursue is
> writing a gRPC wrapper around the DataSet metadata capabilities.  One could
> then create a facade on top of that for Java.  For data reads, I can see
> either building a Flight server or directly use the JNI readers.


Thanks for your suggestion but I'm not entirely getting it. Does this mean to 
start some individual gRPC/Flight server process to deal with the metadata/data 
exchange problem between Java and C++ Datasets? If yes, then in some cases, 
doesn't it easily introduce bigger problems about life cycle and resource 
management of the processes? Please correct me if I misunderstood your point.


And IMHO I don't strongly hate the possible inconsistencies and bugs bought by 
a Java porting of something like the Datasets framework. Inconsistencies are 
usually in a way inevitable between two different languages' implementations of 
the same component, but there is supposed to be a trade-off based on whether 
the implementations arre worth to be provided. I didn't have chance to fully 
investigate the requirements of Datasets-Java from other projects so I'm not 
100% sure but the functionality such as source discovery, predicate pushdown, 
multi-format support could be powerful for many scenarios. Anyway I'm totally 
with you that the work amount could be huge and bugs might be brought. So my 
goal it to start from a small piece of the APIs to minimize the initial work. 
What do you think?


Thanks,
Hongze



At 2019-11-27 16:00:35, "Micah Kornfield"  wrote:
>Hi Hongze,
>I have a strong preference for not porting non-trivial logic from one
>language to another, especially if the main goal is performance.  I think
>this will replicate bugs and cause confusion if inconsistencies occur.  It
>is also a non-trivial amount of work to develop, review, setup CI, etc.
>
>If JNI is seen as too cumbersome, another possible avenue to pursue is
>writing a gRPC wrapper around the DataSet metadata capabilities.  One could
>then create a facade on top of that for Java.  For data reads, I can see
>either building a Flight server or directly use the JNI readers.
>
>In either case this is a non-trivial amount of work, so I at least,
>would appreciate a short write-up (1-2 pages) explicitly stating
>goals/use-cases for the library and a high level design (component overview
>and relationships between components and how it will co-exist with existing
>Java code).  If I understand correctly, one goal is to use this as a basis
>for a new Spark DataSet API with better performance than the vectorized
>spark parquet reader?  Are there others?
>
>Wes, what are your thoughts on this?
>
>Thanks,
>Micah
>
>
>On Tue, Nov 26, 2019 at 10:51 PM Hongze Zhang  wrote:
>
>> Hi Wes and Micah,
>>
>>
>> Thanks for your kindly reply.
>>
>>
>> Micah: We don't use Spark (vectorized) parquet reader because it is a pure
>> Java implementation. Performance could be worse than doing the similar work
>> natively. Another reason is we may need to
>> integrate some other specific data sources with Arrow datasets, for
>> limiting the workload, we would like to maintain a common read pipeline for
>> both this one and other wildly used data sources like Parquet and Csv.
>>
>>
>> Wes: Yes, Datasets framework along with Parquet/CSV/... reader
>> implementations are totally native, So a JNI bridge will be needed then we
>> don't actually read files in Java.
>>
>>
>> My another concern is how many C++ datasets components should be bridged
>> via JNI. For example,
>> bridge the ScanTask only? Or bridge more components including Scanner,
>> Table, even the DataSource
>> discovery system? Or just bridge the C++ arrow Parquet, Orc readers (as
>> Micah said, orc-jni is
>> already there) and reimplement everything needed by datasets in Java? This
>> might be not that easy to
>> decide but currently based on my limited perspective I would prefer to get
>> started from the ScanTask
>> layer as a result we could leverage some valuable work finished in C++
>> datasets and don't have to
>> maintain too much tedious JNI code. The real IO process still take place
>> inside C++ readers when we
>> do scan operation.
>>
>>
>> So Wes, Micah, is this similar to your consideration?
>>
>>
>> Thanks,
>> Hongze
>>
>> At 2019-11-27 12:39:52, "Micah Kornfield"  wrote:
>> >Hi Hongze,
>> >To add to Wes's point, there are already some efforts to do JNI for ORC
>> >(which needs to be integrated with CI) and some open PRs for Parquet in
>> the
>> >project.  However, given that you are using Spark I would expect there is
>> >already dataset functionality that is equivalent to the dataset API to do
>> >rowgroup/partition level filtering.  Can you elaborate on what problems
>> you
>> >are seeing with those 

[jira] [Created] (ARROW-7270) [Go] preserve CSV reading behaviour, improve memory usage

2019-11-27 Thread Sebastien Binet (Jira)
Sebastien Binet created ARROW-7270:
--

 Summary: [Go] preserve CSV reading behaviour, improve memory usage
 Key: ARROW-7270
 URL: https://issues.apache.org/jira/browse/ARROW-7270
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Go
Reporter: Sebastien Binet
Assignee: Sebastien Binet






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Datasets and Java

2019-11-27 Thread Micah Kornfield
Hi Hongze,
I have a strong preference for not porting non-trivial logic from one
language to another, especially if the main goal is performance.  I think
this will replicate bugs and cause confusion if inconsistencies occur.  It
is also a non-trivial amount of work to develop, review, setup CI, etc.

If JNI is seen as too cumbersome, another possible avenue to pursue is
writing a gRPC wrapper around the DataSet metadata capabilities.  One could
then create a facade on top of that for Java.  For data reads, I can see
either building a Flight server or directly use the JNI readers.

In either case this is a non-trivial amount of work, so I at least,
would appreciate a short write-up (1-2 pages) explicitly stating
goals/use-cases for the library and a high level design (component overview
and relationships between components and how it will co-exist with existing
Java code).  If I understand correctly, one goal is to use this as a basis
for a new Spark DataSet API with better performance than the vectorized
spark parquet reader?  Are there others?

Wes, what are your thoughts on this?

Thanks,
Micah


On Tue, Nov 26, 2019 at 10:51 PM Hongze Zhang  wrote:

> Hi Wes and Micah,
>
>
> Thanks for your kindly reply.
>
>
> Micah: We don't use Spark (vectorized) parquet reader because it is a pure
> Java implementation. Performance could be worse than doing the similar work
> natively. Another reason is we may need to
> integrate some other specific data sources with Arrow datasets, for
> limiting the workload, we would like to maintain a common read pipeline for
> both this one and other wildly used data sources like Parquet and Csv.
>
>
> Wes: Yes, Datasets framework along with Parquet/CSV/... reader
> implementations are totally native, So a JNI bridge will be needed then we
> don't actually read files in Java.
>
>
> My another concern is how many C++ datasets components should be bridged
> via JNI. For example,
> bridge the ScanTask only? Or bridge more components including Scanner,
> Table, even the DataSource
> discovery system? Or just bridge the C++ arrow Parquet, Orc readers (as
> Micah said, orc-jni is
> already there) and reimplement everything needed by datasets in Java? This
> might be not that easy to
> decide but currently based on my limited perspective I would prefer to get
> started from the ScanTask
> layer as a result we could leverage some valuable work finished in C++
> datasets and don't have to
> maintain too much tedious JNI code. The real IO process still take place
> inside C++ readers when we
> do scan operation.
>
>
> So Wes, Micah, is this similar to your consideration?
>
>
> Thanks,
> Hongze
>
> At 2019-11-27 12:39:52, "Micah Kornfield"  wrote:
> >Hi Hongze,
> >To add to Wes's point, there are already some efforts to do JNI for ORC
> >(which needs to be integrated with CI) and some open PRs for Parquet in
> the
> >project.  However, given that you are using Spark I would expect there is
> >already dataset functionality that is equivalent to the dataset API to do
> >rowgroup/partition level filtering.  Can you elaborate on what problems
> you
> >are seeing with those and what additional use cases you have?
> >
> >Thanks,
> >Micah
> >
> >
> >On Tue, Nov 26, 2019 at 1:10 PM Wes McKinney  wrote:
> >
> >> hi Hongze,
> >>
> >> The Datasets functionality is indeed extremely useful, and it may make
> >> sense to have it available in many languages eventually. With Java, I
> >> would raise the issue that things are comparatively weaker there when
> >> it comes to actually reading the files themselves. Whereas we have
> >> reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet
> >> in C++ the same is not true in Java. Not a deal breaker but worth
> >> taking into consideration.
> >>
> >> I wonder aloud whether it might be worth investing in a JNI-based
> >> interface to the C++ libraries as one potential approach to save on
> >> development time.
> >>
> >> - Wes
> >>
> >>
> >>
> >> On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang  wrote:
> >> >
> >> > Hi all,
> >> >
> >> >
> >> > Recently the datasets API has been improved a lot and I found some of
> >> the new features are very useful to my own work. For example to me a
> >> important one is the fix of ARROW-6952[1]. And as I currently work on
> >> Java/Scala projects like Spark, I am now investigating a way to call
> some
> >> of the datasets APIs in Java so that I could gain performance
> improvement
> >> from native dataset filters/projectors. Meantime I am also interested in
> >> the ability of scanning different data sources provided by dataset API.
> >> >
> >> >
> >> > Regarding using datasets in Java, my initial idea is to port (by
> writing
> >> Java-version implementations) some of the high-level concepts in Java
> such
> >> as DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call
> >> lower level record batch iterators via JNI. This way we seem to retain
> >> performance advantages from c++ dataset code.
>