Re: DSv2 & DataSourceRegister

Wenchen Fan Tue, 07 Apr 2020 21:16:46 -0700

Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not
sure this is possible as the DS V2 API is very different in 3.0, e.g. there
is no `DataSourceV2` anymore, and you should implement `TableProvider` (if
you don't have database/table).


On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <[email protected]> wrote:

> Hi Ryan,
>
> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <[email protected]> wrote:
> >
> > Hi Andrew,
> >
> > With DataSourceV2, I recommend plugging in a catalog instead of using
> DataSource. As you've noticed, the way that you plug in data sources isn't
> very flexible. That's one of the reasons why we changed the plugin system
> and made it possible to use named catalogs that load implementations based
> on configuration properties.
> >
> > I think it's fine to consider how to patch the registration trait, but I
> really don't recommend continuing to identify table implementations
> directly by name.
>
> Can you be a bit more concrete with what you mean by plugging a
> catalog instead of a DataSource? We have been using
> sc.read.format("root").load([list of paths]) which works well. Since
> we don't have a database or tables, I don't fully understand what's
> different between the two interfaces that would make us prefer one or
> another.
>
> That being said, WRT the registration trait, if I'm not misreading
> createTable() and friends, the "source" parameter is resolved the same
> way as DataFrameReader.format(), so a solution that helps out
> registration should help both interfaces.
>
> Thanks again,
> Andrew
>
> >
> > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <[email protected]>
> wrote:
> >>
> >> Hi all,
> >>
> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
> >> send an email to the dev list for discussion.
> >>
> >> As the DSv2 API evolves, some breaking changes are occasionally made
> >> to the API. It's possible to split a plugin into a "common" part and
> >> multiple version-specific parts and this works OK to have a single
> >> artifact for users, as long as they write out the fully qualified
> >> classname as the DataFrame format(). The one part that can't be
> >> currently worked around is the DataSourceRegister trait. Since classes
> >> which implement DataSourceRegister must also implement DataSourceV2
> >> (and its mixins), changes to those interfaces cause the ServiceLoader
> >> to fail when it attempts to load the "wrong" DataSourceV2 class.
> >> (there's also an additional prohibition against multiple
> >> implementations having the same ShortName in
> >> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
> >> This means users will need to update their notebooks/code/tutorials if
> >> they run @ a different site whose cluster is a different version.
> >>
> >> To solve this, I proposed in SPARK-31363 a new trait who would
> >> function the same as the existing DataSourceRegister trait, but adds
> >> an additional method:
> >>
> >> public Class<? implements DataSourceV2> getImplementation();
> >>
> >> ...which will allow DSv2 plugins to dynamically choose the appropriate
> >> DataSourceV2 class based on the runtime environment. This would let us
> >> release a single artifact for different Spark versions and users could
> >> use the same artifactID & format regardless of where they were
> >> executing their code. If there was no services registered with this
> >> new trait, the functionality would remain the same as before.
> >>
> >> I think this functionality will be useful as DSv2 continues to evolve,
> >> please let me know your thoughts.
> >>
> >> Thanks
> >> Andrew
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: [email protected]
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: DSv2 & DataSourceRegister

Reply via email to