Re: Intro & kudu-spark support for persisting DataFrames

Andy Grove Fri, 06 May 2016 06:48:22 -0700

Mike,

Thanks. I have amended the original commit to assume the table already
exists and have mostly addressed Dan's feedback, and then created a second
commit to add deleteTable/createTable methods. I made CreateTableOptions a
parameter to createTable but I expect more may be required.


After doing all of this I realized that I did not address one of Dan's
comments on the first commit and I'm not sure how to address that with
Gerrit so I'll go ahead and issue a third commit to improve data type
coverage later today.

I'd like to add some tests too, so I'm working on getting the test suite to
run at the moment.

Thanks,

Andy.

On Thu, May 5, 2016 at 9:20 PM, Mike Percy <[email protected]> wrote:

> Hey Andy, thanks for posting your code to Gerrit. It's great to see more
> work progressing on the Spark side of things!
>
> Some other stuff you guys might want to consider for the create table
> feature, in addition to the partitioning details, is the ability to specify
> replication factor. In fact, one might want to specify any of the stuff in
> the CreateTableRequestPB
> <
> https://github.com/apache/incubator-kudu/blob/master/src/kudu/master/master.proto#L337
> >
> :
>
> message CreateTableRequestPB {
>   required string name = 1;
>   required SchemaPB schema = 2;
>   optional RowOperationsPB split_rows = 6;
>   optional PartitionSchemaPB partition_schema = 7;
>   optional int32 num_replicas = 4;
> }
>
> Ideally you could even specify encodings for the columns (dictionary
> encoding, etc) as part of the schema.
>
> All of this does seem orthogonal to the DataFrame writing capability, which
> is probably more straightforward in terms of features. Consider treating
> these as two separate patches: one that assumes the table already exists,
> and one that adds the table creation capability. Gerrit will treat two
> separate commits as two separate code reviews, even if one depends on the
> other (it can track that). Doing that will make it easier to review and
> probably faster to get the different parts of it committed.
>
> Mike
>
>
> --
> Mike Percy
> Software Engineer, Cloudera
>
>
> On Thu, May 5, 2016 at 7:52 PM, Andy Grove <[email protected]> wrote:
>
> > Hi Dan,
> >
> > Thanks for the feedback. I'm thinking that the PK could be specified by
> > passing in a tuple of column indices to use for the key (assuming that
> > composite keys are supported?).
> >
> > Maybe it would be useful for KuduContext to have createTable(name:
> String,
> > schema: StructType) and dropTable(name: String) methods that can
> eventually
> > be wrapped in Spark SQL?
> >
> > I'm just getting up to speed with Kudu by trying to run TPC-H via Spark
> and
> > the use case for driving the current code was being able to take existing
> > DataFrames (created by reading CSV files) and save them to Kudu so that
> we
> > can then start testing the TPC-H SQL queries.
> >
> > We will be experimenting with both Spark SQL and own own SQL parser /
> query
> > planner for Spark, that we are planning on open sourcing too.
> >
> > Thanks,
> >
> > Andy.
> >
> > On Thu, May 5, 2016 at 7:16 PM, Dan Burkert <[email protected]> wrote:
> >
> > > Hey Andy,
> > >
> > > Thanks for the patch!  I left you some specific feedback in on the
> gerrit
> > > review, but I want to discuss the high level approach a bit.  I think
> the
> > > patch as it's written now is going to have limited use, because it
> > doesn't
> > > allow for specifying primary keys or partitioning, which are critical
> for
> > > correctness and performance. In the long run we will definitely want to
> > be
> > > able to create tables through Spark SQL, but perhaps we should start of
> > > with just inserting/updating rows in existing tables.  It would be
> > > interesting to see how other databases solved this problem, since I'm
> > sure
> > > we're not the only ones with configuration options on table create.
> The
> > > relational databases in particular must have PK options.
> > >
> > > - Dan
> > >
> > > On Thu, May 5, 2016 at 5:51 PM, Andy Grove <[email protected]>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm working with some colleagues at AgilData on Spark/Kudu
> integration
> > > and
> > > > we expect to be able to contribute a number of features to the code
> > base.
> > > >
> > > > To kick things off, here is a gerrit for discussion that adds support
> > for
> > > > persisting a DataFrame to a Kudu table. It would be great to hear
> > > feedback
> > > > and feature requests for this capability.
> > > >
> > > > http://gerrit.cloudera.org:8080/#/c/2969/
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > >
> >
>

Re: Intro & kudu-spark support for persisting DataFrames

Reply via email to