Re: [DISCUSS[ Spark Client jars: maven coordinates and shading

yun zou Fri, 20 Jun 2025 15:42:27 -0700

Hi Dmitri,

I think there might be a misunderstanding about how jars and packages are
published, the shadowJar
job is used to publish the bundle jar for the jar use cases, where all
dependency are packed and users uses
with spark like following:
--jar  polaris-spark-3.5_2.12-1.1.0-incubating-SNAPSHOT-bundle.jar


This is different from when user uses --packages, it uses the regular
project jar without classifier, and all dependency
are resolved and downloaded on installation time. Once formally released,
spark users can use it like following:
--package org.apache.polaris:polaris-spark-3.5_2.12:1.1.0

Note that the regular project jar can not be directly used as --jar without
manually adding other dependency jars, because
it doesn't pack other necessary dependencies. That is why we are pushing
the bundle jar also, which is used to help
the direct jar use cases.

You might be confused by my previous PR
<https://github.com/apache/polaris/pull/1857> where I thought I needed to
remove the classifier to make the package use case
work, i believe I later clarified that it was a false alarm, where we do
not need the bundle jar for the `--package` use case.

I have manually verified both use cases, and we have test automation for
the jar use case, and I have followed up to investigate
how to add a regression test for the package use case also.

Best Regards,
Yun


On Fri, Jun 20, 2025 at 3:23 PM Dmitri Bourlatchkov <di...@apache.org>
wrote:

> Hi Yun,
>
> Re: --packages, what I meant to say is that even with PR 1908, the
> published version has the "bundle" classifier.
> <metadata modelVersion="1.1.0">
>   <groupId>org.apache.polaris</groupId>
>   <artifactId>polaris-spark-3.5_2.12</artifactId>
>   <versioning>
>     <lastUpdated>20250620185923</lastUpdated>
>     <snapshot>
>       <localCopy>true</localCopy>
>     </snapshot>
>     <snapshotVersions>
>       <snapshotVersion>
>         <classifier>bundle</classifier>
>         <extension>jar</extension>
>         <value>1.1.0-incubating-SNAPSHOT</value>
>         <updated>20250620185923</updated>
>       </snapshotVersion>
>
> I manually tested with Spark locally and it seems to work. However,  I
> thought that caused issues before. WDYT?
>
> Re: compiling against shaded packages - I still believe that it is not nice
> from the maintenance POV. Yet, I do not insist on reworking this.
>
> Cheers,
> Dmitri.
>
>
> On Fri, Jun 20, 2025 at 5:09 PM yun zou <yunzou.colost...@gmail.com>
> wrote:
>
> > Hi Dmitri,
> >
> > Regarding to this question:
> >
> >
> >
> >
> > *Current docs [1] suggest using
> > `--packagesorg.apache.polaris:polaris-spark-3.5_2.12:1.0.0` but PR 1908
> > produces`polaris-spark-3.5_2.12-1.1.0-incubating-SNAPSHOT-bundle.jar`
> > (note:bundle, disregard version).*
> >
> > The version number used in the bundle jar is produced with the version
> > number in the
> > current version file in the repo, therefore the one you see is
> > xxx-incubating-SNAPSHOT-bundle.jar.
> > Furthermore, the bundle jar is published for the jar use case, not for
> the
> > package use case. There are
> > two ways to use the Spark Client with Spark:
> > 1) use --packages, where the dependencies are downloaded automatically
> > 2) use --jar, the bundle jar will contain everything user needed without
> > doing extra dependency download
> >
> > When the user uses packages, it is using the package we formally publish
> to
> > maven, which I
> > believe will not have "incubating-SNAPSHOT" in the version anymore, so
> > 1.0.0 will be the right version for
> > actual use when we release 1.0.0. Furthermore, what we give in the doc is
> > always just an example, where we phase it like
> > "
> > Assume the released Polaris Spark client you want to use is
> > `org.apache.polaris:polaris-spark-3.5_2.12:1.0.0`
> > "
> > So it is up to the user to pick up the version they want to use among the
> > published versions, which will only be
> > 1.0.0 now, but later we might publish 1.1.0, 1,2,0 etc.
> >
> >
> >
> >
> > *Instead of compiling against relocated classes, why don't we
> > compileagainst the original Jackson jar, and later relocate the Spark
> > Client to"org.apache.iceberg.shaded.com.fasterxml.jackson.*" ?*
> >
> > Regarding to this, i think it is correct for the Spark Client to use
> shaded
> > jar in iceberg spark client, because our Spark Client
> > is suppose to be fully depend and compatible with the
> > iceberg-spark-runtime, where we intended to use all libraries directly
> > shipped from iceberg-spark-runtime to avoid any potential
> compatibilities,
> > includes RESTClient, Iceberg RestRequest etc.
> > If we are using our own jackson library and relocate it to
> > org.apache.iceberg, first of all, i don't know if it will work or not,
> > other
> > than this, it also potentially end with two different jackson version,
> > which might potentially introduce compatibility issues,
> > especially we use the RESTClient shipped along with the
> > iceberg-spark-runtime. Furthermore, it is very confusing that
> > we are relocating it to namespace org.apache.iceberg*, to me, that is
> even
> > worse than skipping the shaded check.
> > In my point of view, it is correct for the spark client to use the shaded
> > library from iceberg-spark-client, we should not be so
> > concerned about skipping the import check for the spark client project as
> > far as we are clear about the goal we are trying to achieve.
> >
> > WDYT?
> >
> > Best Regards,
> > Yun
> >
> >
> > On Fri, Jun 20, 2025 at 12:58 PM Yufei Gu <flyrain...@gmail.com> wrote:
> >
> > > It's simpler to maintain one version for the same dependency instead of
> > > two. There is no confusion for developers -- I can foresee anyone
> looking
> > > at the build script will ask what the Jackson Spark client eventually
> > > shipped. Upgrading the version is straightforward. But I'd like to know
> > > more details why compiling against a shaded package is preferable here.
> > > Would you mind providing these details?
> > >
> > > Yufei
> > >
> > >
> > > On Fri, Jun 20, 2025 at 12:32 PM Dmitri Bourlatchkov <di...@apache.org
> >
> > > wrote:
> > >
> > > > In any case, IMHO, even updating jackson version numbers in two
> places
> > is
> > > > preferable to compiling against shaded packages.
> > > >
> > > > On Fri, Jun 20, 2025 at 3:25 PM Dmitri Bourlatchkov <
> di...@apache.org>
> > > > wrote:
> > > >
> > > > > I suppose we should be able to get the version of Jackson used by
> > > Iceberg
> > > > > from Iceberg POM information, right?
> > > > >
> > > > > Cheers,
> > > > > Dmitri.
> > > > >
> > > > > On Fri, Jun 20, 2025 at 3:08 PM Yufei Gu <flyrain...@gmail.com>
> > wrote:
> > > > >
> > > > >> That's an interesting idea. But it requires us to maintain the
> > > > consistency
> > > > >> of the Jackson version in two places instead of one. The original
> > > > Jackson
> > > > >> version has to match with the one shaded in Iceberg spark runtime.
> > > Every
> > > > >> time we update one, we have to remember to update another. I'm not
> > > sure
> > > > if
> > > > >> it improves the situation.
> > > > >>
> > > > >> Yufei
> > > > >>
> > > > >>
> > > > >> On Fri, Jun 20, 2025 at 11:43 AM Dmitri Bourlatchkov <
> > > di...@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Yun and Yufei,
> > > > >> >
> > > > >> > > Specifically, why does CreateGenericTableRESTRequest use the
> > > shaded
> > > > >> > Jackson?
> > > > >> >
> > > > >> > As discussed off list, request / response payload classes have
> to
> > > work
> > > > >> with
> > > > >> > the version of Jackson included with the Iceberg Spark jars
> > (because
> > > > >> they
> > > > >> > own the RESTClient).
> > > > >> >
> > > > >> > That in itself is fine.
> > > > >> >
> > > > >> > I'd like to propose a different approach to implementing that in
> > > > >> Polaris,
> > > > >> > though.
> > > > >> >
> > > > >> > Instead of compiling against relocated classes, why don't we
> > compile
> > > > >> > against the original Jackson jar, and later relocate the Spark
> > > Client
> > > > to
> > > > >> > "org.apache.iceberg.shaded.com.fasterxml.jackson.*" ?
> > > > >> >
> > > > >> > I believe Jackson is the only relocation concern.
> > > > >> >
> > > > >> > After relocation we can publish both the "thin" client for use
> > with
> > > > >> > --package in Spark, and the "fat" jar for use with --jar. Both
> > > > artifacts
> > > > >> > will depend on the relocated Iceberg artifacts.
> > > > >> >
> > > > >> > WDYT?
> > > > >> >
> > > > >> > Cheers,
> > > > >> > Dmitri.
> > > > >> >
> > > > >> > On Fri, Jun 20, 2025 at 1:05 PM Dmitri Bourlatchkov <
> > > di...@apache.org
> > > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Thanks for the quick response, Yun!
> > > > >> > >
> > > > >> > > > org.apache.polaris#polaris-core
> > > > >> > > > org.apache.iceberg#iceberg-spark-runtime-3.5_2.12
> > > > >> > >
> > > > >> > > IIRC, polaris-core uses Jackson. iceberg-spark-runtime also
> uses
> > > > >> Jackson,
> > > > >> > > but it shades it.
> > > > >> > >
> > > > >> > > I believe I saw issues with using both shaded and non-shaded
> > > Jackson
> > > > >> in
> > > > >> > > the same Spark env. with Iceberg.
> > > > >> > >
> > > > >> > > This may or may not be a concern for our Spark Client. What I
> > mean
> > > > is
> > > > >> > that
> > > > >> > > it may need some more consideration to be sure.
> > > > >> > >
> > > > >> > > Specifically, why does CreateGenericTableRESTRequest use the
> > > shaded
> > > > >> > > Jackson?
> > > > >> > >
> > > > >> > > WDYT?
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Dmitri.
> > > > >> > >
> > > > >> > > On Fri, Jun 20, 2025 at 12:47 PM yun zou <
> > > > yunzou.colost...@gmail.com>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > >> *-- What is the maven artifact that Spark can automatically
> > pull
> > > > >> > >> (via--packages)*
> > > > >> > >>
> > > > >> > >> Our spark client pulls the following:
> > > > >> > >>
> > > > >> > >> org.apache.polaris#polaris-spark-3.5_2.12
> > > > >> > >>
> > > > >> > >> org.apache.polaris#polaris-core
> > > > >> > >>
> > > > >> > >> org.apache.polaris#polaris-api-management-model
> > > > >> > >>
> > > > >> > >> org.apache.iceberg#iceberg-spark-runtime-3.5_2.12
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> Prior to the change, it also pulled iceberg-core and avro
> > 1.20.0.
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> *-- Does that artifact use shaded dependencies*
> > > > >> > >>
> > > > >> > >> Any usage of classes from iceberg-spark-runtime uses the
> shaded
> > > > >> > libraries
> > > > >> > >> shipped along with the artifacts.
> > > > >> > >>
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> *-- Does that artifact depend on the Iceberg Spark bundle?*
> > > > >> > >>
> > > > >> > >> If you are referring to our spark client, it depends on
> > > > >> > >> iceberg-spark-runtime,
> > > > >> > >> not other bundles.
> > > > >> > >>
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> *-- Is the _code_ running in Spark the same when the Polaris
> > > Spark
> > > > >> > Client
> > > > >> > >> ispulled via --packages and via --jars?*
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> yes, the jar and package will use the same code, where the
> jar
> > > > simply
> > > > >> > >> packs
> > > > >> > >> everything
> > > > >> > >>
> > > > >> > >> for the user and there is no need to download any other
> > > dependency.
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> Best Regards,
> > > > >> > >>
> > > > >> > >> Yun
> > > > >> > >>
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> On Fri, Jun 20, 2025 at 9:18 AM Dmitri Bourlatchkov <
> > > > >> di...@apache.org>
> > > > >> > >> wrote:
> > > > >> > >>
> > > > >> > >> > Some questions for clarification:
> > > > >> > >> >
> > > > >> > >> > * What is the maven artifact that Spark can automatically
> > pull
> > > > (via
> > > > >> > >> > --packages)?
> > > > >> > >> > * Does that artifact use shaded dependencies?
> > > > >> > >> > * Does that artifact depend on the Iceberg Spark bundle?
> > > > >> > >> > * Is the _code_ running in Spark the same when the Polaris
> > > Spark
> > > > >> > Client
> > > > >> > >> is
> > > > >> > >> > pulled via --packages and via --jars?
> > > > >> > >> >
> > > > >> > >> > I know I could have figured that out from code, but I'm
> > asking
> > > > here
> > > > >> > >> because
> > > > >> > >> > I think we may need to review our approach to publishing
> > these
> > > > >> > >> artifacts.
> > > > >> > >> >
> > > > >> > >> > I believe that regardless of the method of including the
> > Client
> > > > >> into
> > > > >> > >> Spark
> > > > >> > >> > runtime, the code has to be exactly the same.... and I
> doubt
> > it
> > > > is
> > > > >> the
> > > > >> > >> same
> > > > >> > >> > now. WDYT?
> > > > >> > >> >
> > > > >> > >> > Thanks,
> > > > >> > >> > Dmitri.
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >> > On Fri, Jun 20, 2025 at 10:15 AM Dmitri Bourlatchkov <
> > > > >> > di...@apache.org>
> > > > >> > >> > wrote:
> > > > >> > >> >
> > > > >> > >> > > Hi All,
> > > > >> > >> > >
> > > > >> > >> > > Re: PR [1908] let's use this thread to clarify the
> problems
> > > > we're
> > > > >> > >> trying
> > > > >> > >> > > to solve and options for solutions.
> > > > >> > >> > >
> > > > >> > >> > > As for me, it looks like some refactoring in the way the
> > > Spark
> > > > >> > Client
> > > > >> > >> is
> > > > >> > >> > > built and published may be needed.
> > > > >> > >> > >
> > > > >> > >> > > I think it makes sense to clarify this before 1.0 to
> avoid
> > > > >> changes
> > > > >> > to
> > > > >> > >> > > Maven coordinates right after 1.0
> > > > >> > >> > >
> > > > >> > >> > > [1908] https://github.com/apache/polaris/pull/1908
> > > > >> > >> > >
> > > > >> > >> > > Thanks,
> > > > >> > >> > > Dmitri.
> > > > >> > >> > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS[ Spark Client jars: maven coordinates and shading

Reply via email to