Re: [DISCUSS[ Spark Client jars: maven coordinates and shading

yun zou Fri, 20 Jun 2025 14:09:33 -0700

Hi Dmitri,

Regarding to this question:





*Current docs [1] suggest using
`--packagesorg.apache.polaris:polaris-spark-3.5_2.12:1.0.0` but PR 1908
produces`polaris-spark-3.5_2.12-1.1.0-incubating-SNAPSHOT-bundle.jar`
(note:bundle, disregard version).*

The version number used in the bundle jar is produced with the version
number in the
current version file in the repo, therefore the one you see is
xxx-incubating-SNAPSHOT-bundle.jar.
Furthermore, the bundle jar is published for the jar use case, not for the
package use case. There are
two ways to use the Spark Client with Spark:
1) use --packages, where the dependencies are downloaded automatically
2) use --jar, the bundle jar will contain everything user needed without
doing extra dependency download

When the user uses packages, it is using the package we formally publish to
maven, which I
believe will not have "incubating-SNAPSHOT" in the version anymore, so
1.0.0 will be the right version for
actual use when we release 1.0.0. Furthermore, what we give in the doc is
always just an example, where we phase it like
"
Assume the released Polaris Spark client you want to use is
`org.apache.polaris:polaris-spark-3.5_2.12:1.0.0`
"
So it is up to the user to pick up the version they want to use among the
published versions, which will only be
1.0.0 now, but later we might publish 1.1.0, 1,2,0 etc.




*Instead of compiling against relocated classes, why don't we
compileagainst the original Jackson jar, and later relocate the Spark
Client to"org.apache.iceberg.shaded.com.fasterxml.jackson.*" ?*

Regarding to this, i think it is correct for the Spark Client to use shaded
jar in iceberg spark client, because our Spark Client
is suppose to be fully depend and compatible with the
iceberg-spark-runtime, where we intended to use all libraries directly
shipped from iceberg-spark-runtime to avoid any potential compatibilities,
includes RESTClient, Iceberg RestRequest etc.
If we are using our own jackson library and relocate it to
org.apache.iceberg, first of all, i don't know if it will work or not,
other
than this, it also potentially end with two different jackson version,
which might potentially introduce compatibility issues,
especially we use the RESTClient shipped along with the
iceberg-spark-runtime. Furthermore, it is very confusing that
we are relocating it to namespace org.apache.iceberg*, to me, that is even
worse than skipping the shaded check.
In my point of view, it is correct for the spark client to use the shaded
library from iceberg-spark-client, we should not be so
concerned about skipping the import check for the spark client project as
far as we are clear about the goal we are trying to achieve.

WDYT?

Best Regards,
Yun


On Fri, Jun 20, 2025 at 12:58 PM Yufei Gu <flyrain...@gmail.com> wrote:

> It's simpler to maintain one version for the same dependency instead of
> two. There is no confusion for developers -- I can foresee anyone looking
> at the build script will ask what the Jackson Spark client eventually
> shipped. Upgrading the version is straightforward. But I'd like to know
> more details why compiling against a shaded package is preferable here.
> Would you mind providing these details?
>
> Yufei
>
>
> On Fri, Jun 20, 2025 at 12:32 PM Dmitri Bourlatchkov <di...@apache.org>
> wrote:
>
> > In any case, IMHO, even updating jackson version numbers in two places is
> > preferable to compiling against shaded packages.
> >
> > On Fri, Jun 20, 2025 at 3:25 PM Dmitri Bourlatchkov <di...@apache.org>
> > wrote:
> >
> > > I suppose we should be able to get the version of Jackson used by
> Iceberg
> > > from Iceberg POM information, right?
> > >
> > > Cheers,
> > > Dmitri.
> > >
> > > On Fri, Jun 20, 2025 at 3:08 PM Yufei Gu <flyrain...@gmail.com> wrote:
> > >
> > >> That's an interesting idea. But it requires us to maintain the
> > consistency
> > >> of the Jackson version in two places instead of one. The original
> > Jackson
> > >> version has to match with the one shaded in Iceberg spark runtime.
> Every
> > >> time we update one, we have to remember to update another. I'm not
> sure
> > if
> > >> it improves the situation.
> > >>
> > >> Yufei
> > >>
> > >>
> > >> On Fri, Jun 20, 2025 at 11:43 AM Dmitri Bourlatchkov <
> di...@apache.org>
> > >> wrote:
> > >>
> > >> > Hi Yun and Yufei,
> > >> >
> > >> > > Specifically, why does CreateGenericTableRESTRequest use the
> shaded
> > >> > Jackson?
> > >> >
> > >> > As discussed off list, request / response payload classes have to
> work
> > >> with
> > >> > the version of Jackson included with the Iceberg Spark jars (because
> > >> they
> > >> > own the RESTClient).
> > >> >
> > >> > That in itself is fine.
> > >> >
> > >> > I'd like to propose a different approach to implementing that in
> > >> Polaris,
> > >> > though.
> > >> >
> > >> > Instead of compiling against relocated classes, why don't we compile
> > >> > against the original Jackson jar, and later relocate the Spark
> Client
> > to
> > >> > "org.apache.iceberg.shaded.com.fasterxml.jackson.*" ?
> > >> >
> > >> > I believe Jackson is the only relocation concern.
> > >> >
> > >> > After relocation we can publish both the "thin" client for use with
> > >> > --package in Spark, and the "fat" jar for use with --jar. Both
> > artifacts
> > >> > will depend on the relocated Iceberg artifacts.
> > >> >
> > >> > WDYT?
> > >> >
> > >> > Cheers,
> > >> > Dmitri.
> > >> >
> > >> > On Fri, Jun 20, 2025 at 1:05 PM Dmitri Bourlatchkov <
> di...@apache.org
> > >
> > >> > wrote:
> > >> >
> > >> > > Thanks for the quick response, Yun!
> > >> > >
> > >> > > > org.apache.polaris#polaris-core
> > >> > > > org.apache.iceberg#iceberg-spark-runtime-3.5_2.12
> > >> > >
> > >> > > IIRC, polaris-core uses Jackson. iceberg-spark-runtime also uses
> > >> Jackson,
> > >> > > but it shades it.
> > >> > >
> > >> > > I believe I saw issues with using both shaded and non-shaded
> Jackson
> > >> in
> > >> > > the same Spark env. with Iceberg.
> > >> > >
> > >> > > This may or may not be a concern for our Spark Client. What I mean
> > is
> > >> > that
> > >> > > it may need some more consideration to be sure.
> > >> > >
> > >> > > Specifically, why does CreateGenericTableRESTRequest use the
> shaded
> > >> > > Jackson?
> > >> > >
> > >> > > WDYT?
> > >> > >
> > >> > > Thanks,
> > >> > > Dmitri.
> > >> > >
> > >> > > On Fri, Jun 20, 2025 at 12:47 PM yun zou <
> > yunzou.colost...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > >> *-- What is the maven artifact that Spark can automatically pull
> > >> > >> (via--packages)*
> > >> > >>
> > >> > >> Our spark client pulls the following:
> > >> > >>
> > >> > >> org.apache.polaris#polaris-spark-3.5_2.12
> > >> > >>
> > >> > >> org.apache.polaris#polaris-core
> > >> > >>
> > >> > >> org.apache.polaris#polaris-api-management-model
> > >> > >>
> > >> > >> org.apache.iceberg#iceberg-spark-runtime-3.5_2.12
> > >> > >>
> > >> > >>
> > >> > >> Prior to the change, it also pulled iceberg-core and avro 1.20.0.
> > >> > >>
> > >> > >>
> > >> > >> *-- Does that artifact use shaded dependencies*
> > >> > >>
> > >> > >> Any usage of classes from iceberg-spark-runtime uses the shaded
> > >> > libraries
> > >> > >> shipped along with the artifacts.
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> *-- Does that artifact depend on the Iceberg Spark bundle?*
> > >> > >>
> > >> > >> If you are referring to our spark client, it depends on
> > >> > >> iceberg-spark-runtime,
> > >> > >> not other bundles.
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> *-- Is the _code_ running in Spark the same when the Polaris
> Spark
> > >> > Client
> > >> > >> ispulled via --packages and via --jars?*
> > >> > >>
> > >> > >>
> > >> > >> yes, the jar and package will use the same code, where the jar
> > simply
> > >> > >> packs
> > >> > >> everything
> > >> > >>
> > >> > >> for the user and there is no need to download any other
> dependency.
> > >> > >>
> > >> > >>
> > >> > >> Best Regards,
> > >> > >>
> > >> > >> Yun
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> On Fri, Jun 20, 2025 at 9:18 AM Dmitri Bourlatchkov <
> > >> di...@apache.org>
> > >> > >> wrote:
> > >> > >>
> > >> > >> > Some questions for clarification:
> > >> > >> >
> > >> > >> > * What is the maven artifact that Spark can automatically pull
> > (via
> > >> > >> > --packages)?
> > >> > >> > * Does that artifact use shaded dependencies?
> > >> > >> > * Does that artifact depend on the Iceberg Spark bundle?
> > >> > >> > * Is the _code_ running in Spark the same when the Polaris
> Spark
> > >> > Client
> > >> > >> is
> > >> > >> > pulled via --packages and via --jars?
> > >> > >> >
> > >> > >> > I know I could have figured that out from code, but I'm asking
> > here
> > >> > >> because
> > >> > >> > I think we may need to review our approach to publishing these
> > >> > >> artifacts.
> > >> > >> >
> > >> > >> > I believe that regardless of the method of including the Client
> > >> into
> > >> > >> Spark
> > >> > >> > runtime, the code has to be exactly the same.... and I doubt it
> > is
> > >> the
> > >> > >> same
> > >> > >> > now. WDYT?
> > >> > >> >
> > >> > >> > Thanks,
> > >> > >> > Dmitri.
> > >> > >> >
> > >> > >> >
> > >> > >> > On Fri, Jun 20, 2025 at 10:15 AM Dmitri Bourlatchkov <
> > >> > di...@apache.org>
> > >> > >> > wrote:
> > >> > >> >
> > >> > >> > > Hi All,
> > >> > >> > >
> > >> > >> > > Re: PR [1908] let's use this thread to clarify the problems
> > we're
> > >> > >> trying
> > >> > >> > > to solve and options for solutions.
> > >> > >> > >
> > >> > >> > > As for me, it looks like some refactoring in the way the
> Spark
> > >> > Client
> > >> > >> is
> > >> > >> > > built and published may be needed.
> > >> > >> > >
> > >> > >> > > I think it makes sense to clarify this before 1.0 to avoid
> > >> changes
> > >> > to
> > >> > >> > > Maven coordinates right after 1.0
> > >> > >> > >
> > >> > >> > > [1908] https://github.com/apache/polaris/pull/1908
> > >> > >> > >
> > >> > >> > > Thanks,
> > >> > >> > > Dmitri.
> > >> > >> > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS[ Spark Client jars: maven coordinates and shading

Reply via email to