It would be good to support your use case, but I'm not sure how to accomplish that. Can you open a PR so that we can discuss it in detail? How can `public Class<? implements DataSourceV2> getImplementation();` be possible in 3.0 as there is no `DataSourceV2`?
On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo <andrew.m...@gmail.com> wrote: > Hello > > On Tue, Apr 7, 2020 at 23:16 Wenchen Fan <cloud0...@gmail.com> wrote: > >> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not >> sure this is possible as the DS V2 API is very different in 3.0, e.g. there >> is no `DataSourceV2` anymore, and you should implement `TableProvider` (if >> you don't have database/table). >> > > Correct, I've got a single jar for both Spark 2.4 and 3.0, with a toplevel > Root_v24 (implements DataSourceV2) and Root_v30 (implements TableProvider). > I can load this jar in a both pyspark 2.4 and 3.0 and it works well -- as > long as I remove the registration from META-INF and pass in the full class > name to the DataFrameReader. > > Thanks > Andrew > > >> On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <andrew.m...@gmail.com> wrote: >> >>> Hi Ryan, >>> >>> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <rb...@netflix.com> wrote: >>> > >>> > Hi Andrew, >>> > >>> > With DataSourceV2, I recommend plugging in a catalog instead of using >>> DataSource. As you've noticed, the way that you plug in data sources isn't >>> very flexible. That's one of the reasons why we changed the plugin system >>> and made it possible to use named catalogs that load implementations based >>> on configuration properties. >>> > >>> > I think it's fine to consider how to patch the registration trait, but >>> I really don't recommend continuing to identify table implementations >>> directly by name. >>> >>> Can you be a bit more concrete with what you mean by plugging a >>> catalog instead of a DataSource? We have been using >>> sc.read.format("root").load([list of paths]) which works well. Since >>> we don't have a database or tables, I don't fully understand what's >>> different between the two interfaces that would make us prefer one or >>> another. >>> >>> That being said, WRT the registration trait, if I'm not misreading >>> createTable() and friends, the "source" parameter is resolved the same >>> way as DataFrameReader.format(), so a solution that helps out >>> registration should help both interfaces. >>> >>> Thanks again, >>> Andrew >>> >>> > >>> > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <andrew.m...@gmail.com> >>> wrote: >>> >> >>> >> Hi all, >>> >> >>> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I >>> >> send an email to the dev list for discussion. >>> >> >>> >> As the DSv2 API evolves, some breaking changes are occasionally made >>> >> to the API. It's possible to split a plugin into a "common" part and >>> >> multiple version-specific parts and this works OK to have a single >>> >> artifact for users, as long as they write out the fully qualified >>> >> classname as the DataFrame format(). The one part that can't be >>> >> currently worked around is the DataSourceRegister trait. Since classes >>> >> which implement DataSourceRegister must also implement DataSourceV2 >>> >> (and its mixins), changes to those interfaces cause the ServiceLoader >>> >> to fail when it attempts to load the "wrong" DataSourceV2 class. >>> >> (there's also an additional prohibition against multiple >>> >> implementations having the same ShortName in >>> >> >>> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource). >>> >> This means users will need to update their notebooks/code/tutorials if >>> >> they run @ a different site whose cluster is a different version. >>> >> >>> >> To solve this, I proposed in SPARK-31363 a new trait who would >>> >> function the same as the existing DataSourceRegister trait, but adds >>> >> an additional method: >>> >> >>> >> public Class<? implements DataSourceV2> getImplementation(); >>> >> >>> >> ...which will allow DSv2 plugins to dynamically choose the appropriate >>> >> DataSourceV2 class based on the runtime environment. This would let us >>> >> release a single artifact for different Spark versions and users could >>> >> use the same artifactID & format regardless of where they were >>> >> executing their code. If there was no services registered with this >>> >> new trait, the functionality would remain the same as before. >>> >> >>> >> I think this functionality will be useful as DSv2 continues to evolve, >>> >> please let me know your thoughts. >>> >> >>> >> Thanks >>> >> Andrew >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >> >>> > >>> > >>> > -- >>> > Ryan Blue >>> > Software Engineer >>> > Netflix >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>