Re: [DISCUSS[ Spark Client jars: maven coordinates and shading

Dmitri Bourlatchkov Fri, 20 Jun 2025 15:31:51 -0700

Hi Yun,

Re: --packages, what I meant to say is that even with PR 1908, the
published version has the "bundle" classifier.
<metadata modelVersion="1.1.0">
  <groupId>org.apache.polaris</groupId>
  <artifactId>polaris-spark-3.5_2.12</artifactId>
  <versioning>
    <lastUpdated>20250620185923</lastUpdated>
    <snapshot>
      <localCopy>true</localCopy>
    </snapshot>
    <snapshotVersions>
      <snapshotVersion>
        <classifier>bundle</classifier>
        <extension>jar</extension>
        <value>1.1.0-incubating-SNAPSHOT</value>
        <updated>20250620185923</updated>
      </snapshotVersion>


I manually tested with Spark locally and it seems to work. However,  I
thought that caused issues before. WDYT?

Re: compiling against shaded packages - I still believe that it is not nice
from the maintenance POV. Yet, I do not insist on reworking this.

Cheers,
Dmitri.


On Fri, Jun 20, 2025 at 5:09 PM yun zou <[email protected]> wrote:

> Hi Dmitri,
>
> Regarding to this question:
>
>
>
>
> *Current docs [1] suggest using
> `--packagesorg.apache.polaris:polaris-spark-3.5_2.12:1.0.0` but PR 1908
> produces`polaris-spark-3.5_2.12-1.1.0-incubating-SNAPSHOT-bundle.jar`
> (note:bundle, disregard version).*
>
> The version number used in the bundle jar is produced with the version
> number in the
> current version file in the repo, therefore the one you see is
> xxx-incubating-SNAPSHOT-bundle.jar.
> Furthermore, the bundle jar is published for the jar use case, not for the
> package use case. There are
> two ways to use the Spark Client with Spark:
> 1) use --packages, where the dependencies are downloaded automatically
> 2) use --jar, the bundle jar will contain everything user needed without
> doing extra dependency download
>
> When the user uses packages, it is using the package we formally publish to
> maven, which I
> believe will not have "incubating-SNAPSHOT" in the version anymore, so
> 1.0.0 will be the right version for
> actual use when we release 1.0.0. Furthermore, what we give in the doc is
> always just an example, where we phase it like
> "
> Assume the released Polaris Spark client you want to use is
> `org.apache.polaris:polaris-spark-3.5_2.12:1.0.0`
> "
> So it is up to the user to pick up the version they want to use among the
> published versions, which will only be
> 1.0.0 now, but later we might publish 1.1.0, 1,2,0 etc.
>
>
>
>
> *Instead of compiling against relocated classes, why don't we
> compileagainst the original Jackson jar, and later relocate the Spark
> Client to"org.apache.iceberg.shaded.com.fasterxml.jackson.*" ?*
>
> Regarding to this, i think it is correct for the Spark Client to use shaded
> jar in iceberg spark client, because our Spark Client
> is suppose to be fully depend and compatible with the
> iceberg-spark-runtime, where we intended to use all libraries directly
> shipped from iceberg-spark-runtime to avoid any potential compatibilities,
> includes RESTClient, Iceberg RestRequest etc.
> If we are using our own jackson library and relocate it to
> org.apache.iceberg, first of all, i don't know if it will work or not,
> other
> than this, it also potentially end with two different jackson version,
> which might potentially introduce compatibility issues,
> especially we use the RESTClient shipped along with the
> iceberg-spark-runtime. Furthermore, it is very confusing that
> we are relocating it to namespace org.apache.iceberg*, to me, that is even
> worse than skipping the shaded check.
> In my point of view, it is correct for the spark client to use the shaded
> library from iceberg-spark-client, we should not be so
> concerned about skipping the import check for the spark client project as
> far as we are clear about the goal we are trying to achieve.
>
> WDYT?
>
> Best Regards,
> Yun
>
>
> On Fri, Jun 20, 2025 at 12:58 PM Yufei Gu <[email protected]> wrote:
>
> > It's simpler to maintain one version for the same dependency instead of
> > two. There is no confusion for developers -- I can foresee anyone looking
> > at the build script will ask what the Jackson Spark client eventually
> > shipped. Upgrading the version is straightforward. But I'd like to know
> > more details why compiling against a shaded package is preferable here.
> > Would you mind providing these details?
> >
> > Yufei
> >
> >
> > On Fri, Jun 20, 2025 at 12:32 PM Dmitri Bourlatchkov <[email protected]>
> > wrote:
> >
> > > In any case, IMHO, even updating jackson version numbers in two places
> is
> > > preferable to compiling against shaded packages.
> > >
> > > On Fri, Jun 20, 2025 at 3:25 PM Dmitri Bourlatchkov <[email protected]>
> > > wrote:
> > >
> > > > I suppose we should be able to get the version of Jackson used by
> > Iceberg
> > > > from Iceberg POM information, right?
> > > >
> > > > Cheers,
> > > > Dmitri.
> > > >
> > > > On Fri, Jun 20, 2025 at 3:08 PM Yufei Gu <[email protected]>
> wrote:
> > > >
> > > >> That's an interesting idea. But it requires us to maintain the
> > > consistency
> > > >> of the Jackson version in two places instead of one. The original
> > > Jackson
> > > >> version has to match with the one shaded in Iceberg spark runtime.
> > Every
> > > >> time we update one, we have to remember to update another. I'm not
> > sure
> > > if
> > > >> it improves the situation.
> > > >>
> > > >> Yufei
> > > >>
> > > >>
> > > >> On Fri, Jun 20, 2025 at 11:43 AM Dmitri Bourlatchkov <
> > [email protected]>
> > > >> wrote:
> > > >>
> > > >> > Hi Yun and Yufei,
> > > >> >
> > > >> > > Specifically, why does CreateGenericTableRESTRequest use the
> > shaded
> > > >> > Jackson?
> > > >> >
> > > >> > As discussed off list, request / response payload classes have to
> > work
> > > >> with
> > > >> > the version of Jackson included with the Iceberg Spark jars
> (because
> > > >> they
> > > >> > own the RESTClient).
> > > >> >
> > > >> > That in itself is fine.
> > > >> >
> > > >> > I'd like to propose a different approach to implementing that in
> > > >> Polaris,
> > > >> > though.
> > > >> >
> > > >> > Instead of compiling against relocated classes, why don't we
> compile
> > > >> > against the original Jackson jar, and later relocate the Spark
> > Client
> > > to
> > > >> > "org.apache.iceberg.shaded.com.fasterxml.jackson.*" ?
> > > >> >
> > > >> > I believe Jackson is the only relocation concern.
> > > >> >
> > > >> > After relocation we can publish both the "thin" client for use
> with
> > > >> > --package in Spark, and the "fat" jar for use with --jar. Both
> > > artifacts
> > > >> > will depend on the relocated Iceberg artifacts.
> > > >> >
> > > >> > WDYT?
> > > >> >
> > > >> > Cheers,
> > > >> > Dmitri.
> > > >> >
> > > >> > On Fri, Jun 20, 2025 at 1:05 PM Dmitri Bourlatchkov <
> > [email protected]
> > > >
> > > >> > wrote:
> > > >> >
> > > >> > > Thanks for the quick response, Yun!
> > > >> > >
> > > >> > > > org.apache.polaris#polaris-core
> > > >> > > > org.apache.iceberg#iceberg-spark-runtime-3.5_2.12
> > > >> > >
> > > >> > > IIRC, polaris-core uses Jackson. iceberg-spark-runtime also uses
> > > >> Jackson,
> > > >> > > but it shades it.
> > > >> > >
> > > >> > > I believe I saw issues with using both shaded and non-shaded
> > Jackson
> > > >> in
> > > >> > > the same Spark env. with Iceberg.
> > > >> > >
> > > >> > > This may or may not be a concern for our Spark Client. What I
> mean
> > > is
> > > >> > that
> > > >> > > it may need some more consideration to be sure.
> > > >> > >
> > > >> > > Specifically, why does CreateGenericTableRESTRequest use the
> > shaded
> > > >> > > Jackson?
> > > >> > >
> > > >> > > WDYT?
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Dmitri.
> > > >> > >
> > > >> > > On Fri, Jun 20, 2025 at 12:47 PM yun zou <
> > > [email protected]>
> > > >> > > wrote:
> > > >> > >
> > > >> > >> *-- What is the maven artifact that Spark can automatically
> pull
> > > >> > >> (via--packages)*
> > > >> > >>
> > > >> > >> Our spark client pulls the following:
> > > >> > >>
> > > >> > >> org.apache.polaris#polaris-spark-3.5_2.12
> > > >> > >>
> > > >> > >> org.apache.polaris#polaris-core
> > > >> > >>
> > > >> > >> org.apache.polaris#polaris-api-management-model
> > > >> > >>
> > > >> > >> org.apache.iceberg#iceberg-spark-runtime-3.5_2.12
> > > >> > >>
> > > >> > >>
> > > >> > >> Prior to the change, it also pulled iceberg-core and avro
> 1.20.0.
> > > >> > >>
> > > >> > >>
> > > >> > >> *-- Does that artifact use shaded dependencies*
> > > >> > >>
> > > >> > >> Any usage of classes from iceberg-spark-runtime uses the shaded
> > > >> > libraries
> > > >> > >> shipped along with the artifacts.
> > > >> > >>
> > > >> > >>
> > > >> > >>
> > > >> > >> *-- Does that artifact depend on the Iceberg Spark bundle?*
> > > >> > >>
> > > >> > >> If you are referring to our spark client, it depends on
> > > >> > >> iceberg-spark-runtime,
> > > >> > >> not other bundles.
> > > >> > >>
> > > >> > >>
> > > >> > >>
> > > >> > >> *-- Is the _code_ running in Spark the same when the Polaris
> > Spark
> > > >> > Client
> > > >> > >> ispulled via --packages and via --jars?*
> > > >> > >>
> > > >> > >>
> > > >> > >> yes, the jar and package will use the same code, where the jar
> > > simply
> > > >> > >> packs
> > > >> > >> everything
> > > >> > >>
> > > >> > >> for the user and there is no need to download any other
> > dependency.
> > > >> > >>
> > > >> > >>
> > > >> > >> Best Regards,
> > > >> > >>
> > > >> > >> Yun
> > > >> > >>
> > > >> > >>
> > > >> > >>
> > > >> > >> On Fri, Jun 20, 2025 at 9:18 AM Dmitri Bourlatchkov <
> > > >> [email protected]>
> > > >> > >> wrote:
> > > >> > >>
> > > >> > >> > Some questions for clarification:
> > > >> > >> >
> > > >> > >> > * What is the maven artifact that Spark can automatically
> pull
> > > (via
> > > >> > >> > --packages)?
> > > >> > >> > * Does that artifact use shaded dependencies?
> > > >> > >> > * Does that artifact depend on the Iceberg Spark bundle?
> > > >> > >> > * Is the _code_ running in Spark the same when the Polaris
> > Spark
> > > >> > Client
> > > >> > >> is
> > > >> > >> > pulled via --packages and via --jars?
> > > >> > >> >
> > > >> > >> > I know I could have figured that out from code, but I'm
> asking
> > > here
> > > >> > >> because
> > > >> > >> > I think we may need to review our approach to publishing
> these
> > > >> > >> artifacts.
> > > >> > >> >
> > > >> > >> > I believe that regardless of the method of including the
> Client
> > > >> into
> > > >> > >> Spark
> > > >> > >> > runtime, the code has to be exactly the same.... and I doubt
> it
> > > is
> > > >> the
> > > >> > >> same
> > > >> > >> > now. WDYT?
> > > >> > >> >
> > > >> > >> > Thanks,
> > > >> > >> > Dmitri.
> > > >> > >> >
> > > >> > >> >
> > > >> > >> > On Fri, Jun 20, 2025 at 10:15 AM Dmitri Bourlatchkov <
> > > >> > [email protected]>
> > > >> > >> > wrote:
> > > >> > >> >
> > > >> > >> > > Hi All,
> > > >> > >> > >
> > > >> > >> > > Re: PR [1908] let's use this thread to clarify the problems
> > > we're
> > > >> > >> trying
> > > >> > >> > > to solve and options for solutions.
> > > >> > >> > >
> > > >> > >> > > As for me, it looks like some refactoring in the way the
> > Spark
> > > >> > Client
> > > >> > >> is
> > > >> > >> > > built and published may be needed.
> > > >> > >> > >
> > > >> > >> > > I think it makes sense to clarify this before 1.0 to avoid
> > > >> changes
> > > >> > to
> > > >> > >> > > Maven coordinates right after 1.0
> > > >> > >> > >
> > > >> > >> > > [1908] https://github.com/apache/polaris/pull/1908
> > > >> > >> > >
> > > >> > >> > > Thanks,
> > > >> > >> > > Dmitri.
> > > >> > >> > >
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS[ Spark Client jars: maven coordinates and shading

Reply via email to