Re: Developing a "dataset" API / framework for Arrow C++ users

Wes McKinney Wed, 27 Feb 2019 12:04:11 -0800

hi Ryan,
On Wed, Feb 27, 2019 at 1:31 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>
> Thanks for pointing out that document, Uwe. I really like the intent and it
> would be really useful to have common components for large datasets. One of
> the questions we are hitting with an Iceberg python implementation is the
> file system abstraction, so I think this is very relevant for all of us.
>


Yes, I agree it would not make sense to duplicate functionality around
this if possible.

In Apache Arrow we have tools that are both higher level (the
"dataset" API discussed here) and lower level (file systems and Arrow
interfaces to file formats) than Iceberg. It would be good to figure
out how to best work together.

> There is significant overlap between this document’s goals and Iceberg’s
> goals, but there isn’t total alignment:
>
>    - Formats: Iceberg tables can contain a mix of Avro, Parquet, and ORC
>    (future)

As stated in the document, I think our goal is to be able to read as
many different file formats and other data sources (e.g. database
drivers) into Arrow record batches with a particular schema. So we
should also be able to interact with Iceberg tables and their physical
storage

>    - Files vs table: Iceberg requires a table abstraction. The intent is
>    for users to work with tables and not care about the individual files
>    underneath.

I think this is not necessarily in conflict so long as Arrow record
batches can be exposed to an Arrow consumer without any copying or
conversion

>    - Schema evolution: Iceberg *tables* have defined schemas and guarantee
>    that the current schema can read all existing data files. This uses a set
>    of evolution rules and is ID based. The doc’s goal is a harder problem:
>    building a schema around CSV, JSON, and other files. I don’t think there is
>    a good way to make all of these formats appear to use the same rules.

>From the Arrow dataset consumer perspective, Iceberg serves as another
provider of record batches. As long as the dataset "fragment provider"
has the contract of providing record batches matching a particular
schema (where in the case of Iceberg, the dataset's schema is managed
by Iceberg) then there is no issue from my perspective.

Indeed we do need to address some other problems like schema-on-read
for directories of CSV or JSON files. We aren't expecting you to deal
with that for us.

>    - Partitioning: Iceberg’s partition model is maintained in metadata and
>    supports hidden partitions that are automatically derived.

My objective would be for the partitioning APIs in the Arrow C++
libraries to be suitably general so that the bridge to Iceberg would
be "just another implementation" of the Arrow dataset API.

>
> It would be great to see where Iceberg can be used, but it isn’t a solution
> for all of these goals and it isn’t intended to be. We have users that want
> to make a directory of JSON or CSV appear like a table, and our plan is to
> use a Hive table approach for those use cases.
>
> But for situations where users want schema evolution without surprises,
> want a separation between the logical data and its underlying physical
> structure, or need atomic changes with concurrent readers and writers, we
> plan to use Iceberg.

Seems reasonable. I don't see the work that is described here to be in
conflict with what you are trying to achieve. It would be nice if we
could coordinate to avoid any redundant efforts on the C++/Python
side, that way the common work can also benefit other consumers of the
Arrow C++ library stack (MATLAB, R, and Ruby, and probably others in
the future). Making use of common filesystem abstractions and
implementations (for AWS S3, GCS, Azure DLS) seems like an obvious
community win -- there are Arrow JIRA issues about this already.
Efficient reading of fragments stored as files in the cloud as Arrow
record batches would best be addressed under a single roof (in the
absence of any prior art to depend on -- e.g. unfortunately the
related platform code in TensorFlow isn't available in a very reusable
form to us)

I'll follow up on your other comments in the requirements document.

- Wes

>
> On Mon, Feb 25, 2019 at 1:50 AM Uwe L. Korn <m...@uwekorn.com> wrote:
>
> > Hello,
> >
> > this should definitely be shared with the Apache Iceberg community
> > (cc'ed). The title of the document may be a bit confusing. What is proposed
> > in there is actually constructing the building blocks in C++ that are
> > required for supporting Python/C++/.. implementations for things like
> > Iceberg.
> >
> > While there are things proposed in the document that may overlap a bit
> > with Iceberg, Icebergs main goal is to define a table format whereas the
> > things in the document should support the underlying I/O capabilities of
> > the table format but don't specify a table format.
> >
> > Cheers
> >
> > Uwe
> >
> > On Mon, Feb 25, 2019, at 10:20 AM, Joel Pfaff wrote:
> > > Hello,
> > >
> > > Thanks for the write-up.
> > >
> > > Have you considered sharing this document with the Apache Iceberg
> > community?
> > >
> > > My feeling is that there are some shared goals here between the two
> > > projects.
> > > And while their implementation is in Java, their spec is language
> > agnostic.
> > >
> > > Regards, Joel
> > >
> > >
> > > On Sun, Feb 24, 2019 at 6:56 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > hi folks,
> > > >
> > > > We've spent a good amount of energy up until now implementing
> > > > interfaces for reading different kinds of file formats in C++, like
> > > > Parquet, ORC, CSV, and JSON. There's some higher level layers missing,
> > > > through, which are necessary if we want to make use of these file
> > > > formats in the context of an in-memory query engine. This includes:
> > > >
> > > > * Scanning multiple files as a single logical dataset
> > > > * Schema normalization and evolution
> > > > * Handling partitioned datasets, and datasets consistenting of
> > > > heterogeneous storage (a mix of file formats)
> > > > * Predicate pushdown: taking row filtering and column selection into
> > > > account while reading a file
> > > >
> > > > We have implemented some parts of this already in limited form for
> > > > Python users in the pyarrow.parquet module. This is problematic since
> > > > a) it is implemented in Python and cannot be used by Ruby or R, for
> > > > example and b) it is specific to a single file format
> > > >
> > > > Since this is a large topic, I tried to write up a summary of what I
> > > > believe to be the important problems that need to be solved:
> > > >
> > > >
> > > >
> > https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing
> > > >
> > > > This project will also allow for "user-defined" data sources, so that
> > > > other people in the Arrow ecosystem can contribute new data interfaces
> > > > to interact with different kinds of storage systems using a common
> > > > API, so if they want to "plug in" to any computation layers available
> > > > in Apache Arrow then there is a reasonably straightforward path to do
> > > > that.
> > > >
> > > > Your comments and ideas on this project would be appreciated.
> > > >
> > > > Thank you,
> > > > Wes
> > > >
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: Developing a "dataset" API / framework for Arrow C++ users

Reply via email to