I think partition management feature would be very useful in RDBMSes that 
support it – e.g. Oracle, PostgreSQL, and DB2.
In some cases add partitions can be explicit and can/may be done outside of 
data loads.
But in some other cases, it may/can need to be done implicitly when supported  
by the platform.
Similar to the static/dynamic partition loading in Hive and Oracle.

So in short, I agree that partition management should be an optional interface.

From: Ryan Blue <rb...@netflix.com>
Reply-To: "rb...@netflix.com" <rb...@netflix.com>
Date: Wednesday, September 19, 2018 at 2:58 PM
To: "Thakrar, Jayesh" <jthak...@conversantmedia.com>
Cc: "tigerqu...@outlook.com" <tigerqu...@outlook.com>, Spark Dev List 
<dev@spark.apache.org>
Subject: Re: [Discuss] Datasource v2 support for manipulating partitions

I'm open to exploring the idea of adding partition management as a catalog API. 
The approach we're taking is to have an interface for each concern a catalog 
might implement, like TableCatalog (proposed in SPARK-24252), but also 
FunctionCatalog for stored functions and possibly PartitionedTableCatalog for 
explicitly partitioned tables.

That could definitely be used to implement ALTER TABLE ADD/DROP PARTITION for 
Hive tables, although I'm not sure that we would want to continue exposing 
partitions for simple tables. I know that this is important for storage systems 
like Kudu, but I think it is needlessly difficult and annoying for simple 
tables that are partitioned by a regular transformation like Hive tables. 
That's why Iceberg hides partitioning outside of table configuration. That also 
avoids problems where SELECT DISTINCT queries are wrong because a partition 
exists but has no data.

How useful is this outside of Kudu? Is it something that we should provide an 
API for, or is it specific enough to Kudu that Spark shouldn't include it in 
the API for all sources?

rb


On Tue, Sep 18, 2018 at 7:38 AM Thakrar, Jayesh 
<jthak...@conversantmedia.com<mailto:jthak...@conversantmedia.com>> wrote:
Totally agree with you Dale, that there are situations for efficiency, 
performance and better control/visibility/manageability that we need to expose 
partition management.

So as described, I suggested two things - the ability to do it in the current 
V2 API form via options and appropriate implementation in datasource 
reader/writer.

And for long term, suggested that partition management can be made part of 
metadata/catalog management - SPARK-24252 (DataSourceV2: Add catalog support)?


On 9/17/18, 8:26 PM, "tigerquoll" 
<tigerqu...@outlook.com<mailto:tigerqu...@outlook.com>> wrote:

    Hi Jayesh,
    I get where you are coming from - partitions are just an implementation
    optimisation that we really shouldn’t be bothering the end user with.
    Unfortunately that view is like saying RPC is like a procedure call, and
    details of the network transport should be hidden from the end user. CORBA
    tried this approach for RPC and failed for the same reason that no major
    vendor of DBMS systems that support partitions try to hide them from the end
    user.  They have a substantial real world effect that is impossible to hide
    from the user (in particular when writing/modifying the data source).  Any
    attempt to “take care” of partitions automatically invariably guesses wrong
    and ends up frustrating the end user (as “substantial real world effect”
    turns to “show stopping performance penalty” if the user attempts to fight
    against a partitioning scheme she has no idea exists)

    So if we are not hiding them from the user, we need to allow users to
    manipulate them. Either by representing them generically in the API,
    allowing pass-through commands to manipulate them, or by some other means.

    Regards,
    Dale.




    --
    Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/




--
Ryan Blue
Software Engineer
Netflix

Reply via email to