Re: Developing a "dataset" API / framework for Arrow C++ users

Ryan Blue Wed, 27 Feb 2019 11:32:22 -0800

Thanks for pointing out that document, Uwe. I really like the intent and it
would be really useful to have common components for large datasets. One of
the questions we are hitting with an Iceberg python implementation is the
file system abstraction, so I think this is very relevant for all of us.


There is significant overlap between this document’s goals and Iceberg’s
goals, but there isn’t total alignment:

   - Formats: Iceberg tables can contain a mix of Avro, Parquet, and ORC
   (future)
   - Files vs table: Iceberg requires a table abstraction. The intent is
   for users to work with tables and not care about the individual files
   underneath.
   - Schema evolution: Iceberg *tables* have defined schemas and guarantee
   that the current schema can read all existing data files. This uses a set
   of evolution rules and is ID based. The doc’s goal is a harder problem:
   building a schema around CSV, JSON, and other files. I don’t think there is
   a good way to make all of these formats appear to use the same rules.
   - Partitioning: Iceberg’s partition model is maintained in metadata and
   supports hidden partitions that are automatically derived.

It would be great to see where Iceberg can be used, but it isn’t a solution
for all of these goals and it isn’t intended to be. We have users that want
to make a directory of JSON or CSV appear like a table, and our plan is to
use a Hive table approach for those use cases.

But for situations where users want schema evolution without surprises,
want a separation between the logical data and its underlying physical
structure, or need atomic changes with concurrent readers and writers, we
plan to use Iceberg.

On Mon, Feb 25, 2019 at 1:50 AM Uwe L. Korn <m...@uwekorn.com> wrote:

> Hello,
>
> this should definitely be shared with the Apache Iceberg community
> (cc'ed). The title of the document may be a bit confusing. What is proposed
> in there is actually constructing the building blocks in C++ that are
> required for supporting Python/C++/.. implementations for things like
> Iceberg.
>
> While there are things proposed in the document that may overlap a bit
> with Iceberg, Icebergs main goal is to define a table format whereas the
> things in the document should support the underlying I/O capabilities of
> the table format but don't specify a table format.
>
> Cheers
>
> Uwe
>
> On Mon, Feb 25, 2019, at 10:20 AM, Joel Pfaff wrote:
> > Hello,
> >
> > Thanks for the write-up.
> >
> > Have you considered sharing this document with the Apache Iceberg
> community?
> >
> > My feeling is that there are some shared goals here between the two
> > projects.
> > And while their implementation is in Java, their spec is language
> agnostic.
> >
> > Regards, Joel
> >
> >
> > On Sun, Feb 24, 2019 at 6:56 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > hi folks,
> > >
> > > We've spent a good amount of energy up until now implementing
> > > interfaces for reading different kinds of file formats in C++, like
> > > Parquet, ORC, CSV, and JSON. There's some higher level layers missing,
> > > through, which are necessary if we want to make use of these file
> > > formats in the context of an in-memory query engine. This includes:
> > >
> > > * Scanning multiple files as a single logical dataset
> > > * Schema normalization and evolution
> > > * Handling partitioned datasets, and datasets consistenting of
> > > heterogeneous storage (a mix of file formats)
> > > * Predicate pushdown: taking row filtering and column selection into
> > > account while reading a file
> > >
> > > We have implemented some parts of this already in limited form for
> > > Python users in the pyarrow.parquet module. This is problematic since
> > > a) it is implemented in Python and cannot be used by Ruby or R, for
> > > example and b) it is specific to a single file format
> > >
> > > Since this is a large topic, I tried to write up a summary of what I
> > > believe to be the important problems that need to be solved:
> > >
> > >
> > >
> https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing
> > >
> > > This project will also allow for "user-defined" data sources, so that
> > > other people in the Arrow ecosystem can contribute new data interfaces
> > > to interact with different kinds of storage systems using a common
> > > API, so if they want to "plug in" to any computation layers available
> > > in Apache Arrow then there is a reasonably straightforward path to do
> > > that.
> > >
> > > Your comments and ideas on this project would be appreciated.
> > >
> > > Thank you,
> > > Wes
> > >
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Developing a "dataset" API / framework for Arrow C++ users

Reply via email to