I've been following the development of the new data source abstraction with keen interest. One of the issues that has occurred to me as I sat down and planned how I would implement a data source is how I would support manipulating partitions.
My reading of the current prototype is that Data source v2 APIs expose enough of a concept of a partition to support communicating record distribution particulars to catalyst, but does not represent partitions as a concept that the end user of the data sources can manipulate. The end users of data sources need to be able to add/drop/modify and list partitions. For example, many systems require partitions to be created before records are added to them. For batch use-cases, it may be possible for users to manipulate partitions from within the environment that the data source interfaces to, but for streaming use-cases, this is not at all practical. Two ways I can think of doing this are: 1. Allow "pass-through" commands to the underlying data source 2. Have a generic concept of partitions exposed to the end user via the data source API and Spark SQL DML. I'm keen for option 2 but recognise that its possible there are better alternatives out there. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org