Re: Intro & kudu-spark support for persisting DataFrames

Mike Percy Sat, 07 May 2016 20:53:29 -0700

Hey Andy, I think you may have solved this already but the way to apply an
update after-the-fact to a prior patch: (Assuming you have patch 1, 2, and
3 where patch 3 is logically part of the change in 1.) You do an
interactive git rebase (git rebase -i) and squash the 3rd patch into the
first patch (making sure to retain the Commit-id of the first patch in the
git commit message). After the rebase you are left with only 2 commits
again. Then you just repush both commits to Gerrit. Actually, it is
possible to only repush commit 1 to Gerrit in this case (git checkout
commit-1-hash; push to gerrit; git checkout branch-name).


HTH,
Mike

On Fri, May 6, 2016 at 6:47 AM, Andy Grove <[email protected]> wrote:

> Mike,
>
> Thanks. I have amended the original commit to assume the table already
> exists and have mostly addressed Dan's feedback, and then created a second
> commit to add deleteTable/createTable methods. I made CreateTableOptions a
> parameter to createTable but I expect more may be required.
>
> After doing all of this I realized that I did not address one of Dan's
> comments on the first commit and I'm not sure how to address that with
> Gerrit so I'll go ahead and issue a third commit to improve data type
> coverage later today.
>
> I'd like to add some tests too, so I'm working on getting the test suite to
> run at the moment.
>
> Thanks,
>
> Andy.
>
> On Thu, May 5, 2016 at 9:20 PM, Mike Percy <[email protected]> wrote:
>
> > Hey Andy, thanks for posting your code to Gerrit. It's great to see more
> > work progressing on the Spark side of things!
> >
> > Some other stuff you guys might want to consider for the create table
> > feature, in addition to the partitioning details, is the ability to
> specify
> > replication factor. In fact, one might want to specify any of the stuff
> in
> > the CreateTableRequestPB
> > <
> >
> https://github.com/apache/incubator-kudu/blob/master/src/kudu/master/master.proto#L337
> > >
> > :
> >
> > message CreateTableRequestPB {
> >   required string name = 1;
> >   required SchemaPB schema = 2;
> >   optional RowOperationsPB split_rows = 6;
> >   optional PartitionSchemaPB partition_schema = 7;
> >   optional int32 num_replicas = 4;
> > }
> >
> > Ideally you could even specify encodings for the columns (dictionary
> > encoding, etc) as part of the schema.
> >
> > All of this does seem orthogonal to the DataFrame writing capability,
> which
> > is probably more straightforward in terms of features. Consider treating
> > these as two separate patches: one that assumes the table already exists,
> > and one that adds the table creation capability. Gerrit will treat two
> > separate commits as two separate code reviews, even if one depends on the
> > other (it can track that). Doing that will make it easier to review and
> > probably faster to get the different parts of it committed.
> >
> > Mike
> >
> >
> > --
> > Mike Percy
> > Software Engineer, Cloudera
> >
> >
> > On Thu, May 5, 2016 at 7:52 PM, Andy Grove <[email protected]>
> wrote:
> >
> > > Hi Dan,
> > >
> > > Thanks for the feedback. I'm thinking that the PK could be specified by
> > > passing in a tuple of column indices to use for the key (assuming that
> > > composite keys are supported?).
> > >
> > > Maybe it would be useful for KuduContext to have createTable(name:
> > String,
> > > schema: StructType) and dropTable(name: String) methods that can
> > eventually
> > > be wrapped in Spark SQL?
> > >
> > > I'm just getting up to speed with Kudu by trying to run TPC-H via Spark
> > and
> > > the use case for driving the current code was being able to take
> existing
> > > DataFrames (created by reading CSV files) and save them to Kudu so that
> > we
> > > can then start testing the TPC-H SQL queries.
> > >
> > > We will be experimenting with both Spark SQL and own own SQL parser /
> > query
> > > planner for Spark, that we are planning on open sourcing too.
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > > On Thu, May 5, 2016 at 7:16 PM, Dan Burkert <[email protected]> wrote:
> > >
> > > > Hey Andy,
> > > >
> > > > Thanks for the patch!  I left you some specific feedback in on the
> > gerrit
> > > > review, but I want to discuss the high level approach a bit.  I think
> > the
> > > > patch as it's written now is going to have limited use, because it
> > > doesn't
> > > > allow for specifying primary keys or partitioning, which are critical
> > for
> > > > correctness and performance. In the long run we will definitely want
> to
> > > be
> > > > able to create tables through Spark SQL, but perhaps we should start
> of
> > > > with just inserting/updating rows in existing tables.  It would be
> > > > interesting to see how other databases solved this problem, since I'm
> > > sure
> > > > we're not the only ones with configuration options on table create.
> > The
> > > > relational databases in particular must have PK options.
> > > >
> > > > - Dan
> > > >
> > > > On Thu, May 5, 2016 at 5:51 PM, Andy Grove <[email protected]>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm working with some colleagues at AgilData on Spark/Kudu
> > integration
> > > > and
> > > > > we expect to be able to contribute a number of features to the code
> > > base.
> > > > >
> > > > > To kick things off, here is a gerrit for discussion that adds
> support
> > > for
> > > > > persisting a DataFrame to a Kudu table. It would be great to hear
> > > > feedback
> > > > > and feature requests for this capability.
> > > > >
> > > > > http://gerrit.cloudera.org:8080/#/c/2969/
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > > >
> > > >
> > >
> >
>

Re: Intro & kudu-spark support for persisting DataFrames

Reply via email to