What does partition management look like in those systems and what are the options we would standardize in an API?
On Wed, Sep 19, 2018 at 2:16 PM Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > I think partition management feature would be very useful in RDBMSes that > support it – e.g. Oracle, PostgreSQL, and DB2. > > In some cases add partitions can be explicit and can/may be done outside > of data loads. > > But in some other cases, it may/can need to be done implicitly when > supported by the platform. > > Similar to the static/dynamic partition loading in Hive and Oracle. > > > > So in short, I agree that partition management should be an optional > interface. > > > > *From: *Ryan Blue <rb...@netflix.com> > *Reply-To: *"rb...@netflix.com" <rb...@netflix.com> > *Date: *Wednesday, September 19, 2018 at 2:58 PM > *To: *"Thakrar, Jayesh" <jthak...@conversantmedia.com> > *Cc: *"tigerqu...@outlook.com" <tigerqu...@outlook.com>, Spark Dev List < > dev@spark.apache.org> > *Subject: *Re: [Discuss] Datasource v2 support for manipulating partitions > > > > I'm open to exploring the idea of adding partition management as a catalog > API. The approach we're taking is to have an interface for each concern a > catalog might implement, like TableCatalog (proposed in SPARK-24252), but > also FunctionCatalog for stored functions and possibly > PartitionedTableCatalog for explicitly partitioned tables. > > > > That could definitely be used to implement ALTER TABLE ADD/DROP PARTITION > for Hive tables, although I'm not sure that we would want to continue > exposing partitions for simple tables. I know that this is important for > storage systems like Kudu, but I think it is needlessly difficult and > annoying for simple tables that are partitioned by a regular transformation > like Hive tables. That's why Iceberg hides partitioning outside of table > configuration. That also avoids problems where SELECT DISTINCT queries are > wrong because a partition exists but has no data. > > > > How useful is this outside of Kudu? Is it something that we should provide > an API for, or is it specific enough to Kudu that Spark shouldn't include > it in the API for all sources? > > > > rb > > > > > > On Tue, Sep 18, 2018 at 7:38 AM Thakrar, Jayesh < > jthak...@conversantmedia.com> wrote: > > Totally agree with you Dale, that there are situations for efficiency, > performance and better control/visibility/manageability that we need to > expose partition management. > > So as described, I suggested two things - the ability to do it in the > current V2 API form via options and appropriate implementation in > datasource reader/writer. > > And for long term, suggested that partition management can be made part of > metadata/catalog management - SPARK-24252 (DataSourceV2: Add catalog > support)? > > > On 9/17/18, 8:26 PM, "tigerquoll" <tigerqu...@outlook.com> wrote: > > Hi Jayesh, > I get where you are coming from - partitions are just an implementation > optimisation that we really shouldn’t be bothering the end user with. > Unfortunately that view is like saying RPC is like a procedure call, > and > details of the network transport should be hidden from the end user. > CORBA > tried this approach for RPC and failed for the same reason that no > major > vendor of DBMS systems that support partitions try to hide them from > the end > user. They have a substantial real world effect that is impossible to > hide > from the user (in particular when writing/modifying the data source). > Any > attempt to “take care” of partitions automatically invariably guesses > wrong > and ends up frustrating the end user (as “substantial real world > effect” > turns to “show stopping performance penalty” if the user attempts to > fight > against a partitioning scheme she has no idea exists) > > So if we are not hiding them from the user, we need to allow users to > manipulate them. Either by representing them generically in the API, > allowing pass-through commands to manipulate them, or by some other > means. > > Regards, > Dale. > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > -- Ryan Blue Software Engineer Netflix