Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not sure this is possible as the DS V2 API is very different in 3.0, e.g. there is no `DataSourceV2` anymore, and you should implement `TableProvider` (if you don't have database/table).
On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <andrew.m...@gmail.com> wrote: > Hi Ryan, > > On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <rb...@netflix.com> wrote: > > > > Hi Andrew, > > > > With DataSourceV2, I recommend plugging in a catalog instead of using > DataSource. As you've noticed, the way that you plug in data sources isn't > very flexible. That's one of the reasons why we changed the plugin system > and made it possible to use named catalogs that load implementations based > on configuration properties. > > > > I think it's fine to consider how to patch the registration trait, but I > really don't recommend continuing to identify table implementations > directly by name. > > Can you be a bit more concrete with what you mean by plugging a > catalog instead of a DataSource? We have been using > sc.read.format("root").load([list of paths]) which works well. Since > we don't have a database or tables, I don't fully understand what's > different between the two interfaces that would make us prefer one or > another. > > That being said, WRT the registration trait, if I'm not misreading > createTable() and friends, the "source" parameter is resolved the same > way as DataFrameReader.format(), so a solution that helps out > registration should help both interfaces. > > Thanks again, > Andrew > > > > > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <andrew.m...@gmail.com> > wrote: > >> > >> Hi all, > >> > >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I > >> send an email to the dev list for discussion. > >> > >> As the DSv2 API evolves, some breaking changes are occasionally made > >> to the API. It's possible to split a plugin into a "common" part and > >> multiple version-specific parts and this works OK to have a single > >> artifact for users, as long as they write out the fully qualified > >> classname as the DataFrame format(). The one part that can't be > >> currently worked around is the DataSourceRegister trait. Since classes > >> which implement DataSourceRegister must also implement DataSourceV2 > >> (and its mixins), changes to those interfaces cause the ServiceLoader > >> to fail when it attempts to load the "wrong" DataSourceV2 class. > >> (there's also an additional prohibition against multiple > >> implementations having the same ShortName in > >> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource). > >> This means users will need to update their notebooks/code/tutorials if > >> they run @ a different site whose cluster is a different version. > >> > >> To solve this, I proposed in SPARK-31363 a new trait who would > >> function the same as the existing DataSourceRegister trait, but adds > >> an additional method: > >> > >> public Class<? implements DataSourceV2> getImplementation(); > >> > >> ...which will allow DSv2 plugins to dynamically choose the appropriate > >> DataSourceV2 class based on the runtime environment. This would let us > >> release a single artifact for different Spark versions and users could > >> use the same artifactID & format regardless of where they were > >> executing their code. If there was no services registered with this > >> new trait, the functionality would remain the same as before. > >> > >> I think this functionality will be useful as DSv2 continues to evolve, > >> please let me know your thoughts. > >> > >> Thanks > >> Andrew > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >> > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >