Re: [DISCUSS] propdeps removal and what to do going forward

2022-01-19 Thread Kenneth Knowles
On Wed, Jan 19, 2022 at 10:59 AM Brian Hulette  wrote:

> > But I believe the issue is that some tooling will bundle up the
> "compile" dependencies and submit with the job, which will then have a
> conflict with the libraries on the cluster.
>
> Do you know specifically what tooling does this? Also I'm not clear what
> you mean by "cluster", is this referring to remote workers?
>

No I don't, actually. I was being deliberately vague. I think "cluster"
could be Spark (or Flink) workers, etc. There are some flags to tweak whose
dependencies win. Dataflow workers also could have this issue. I guess
something building an uber jar would be the typical example.

To be super clear: I'm pretty much in agreement with Daniel, but I also
think there must be some good reason we should account for that it isn't
done that way already.

To unblock the release, Emily has opened a PR to match the prior pom
exactly, which I also think is smart. We can fix up / eliminate `provided`
deps one or a few at a time moving on.

Kenn


>
> On Tue, Jan 18, 2022 at 6:11 AM Kenneth Knowles  wrote:
>
>>
>>
>> On Fri, Jan 14, 2022 at 9:34 AM Daniel Collins 
>> wrote:
>>
>>> > In particular the Hadoop/Spark and Kafka dependencies must be
>>> **provided** as they were. I am not sure of others but those three matter.
>>>
>>> I think there's a bit of a difference here between what should be the
>>> state in the short term versus the long term.
>>>
>>> In the short term, I agree that we should avoid changes to how these
>>> dependencies are reflected in the POM.
>>>
>>> In the long term, I don't think it makes sense for these to continue to
>>> be "provided" dependencies- if users wish to use a different version of
>>> hadoop, spark or kafka, they can explicitly override the dependencies with
>>> the version they want when building their JAR, even if there is a version
>>> listed as "compile" in the POM file on maven central. The only difference
>>> is that if they don't have a version preference, the one listed in the POM
>>> (that we tested with) will be used, which seems like an unambiguous win to
>>> me.
>>>
>>
>> Agree with the sentiment. But I believe the issue is that some tooling
>> will bundle up the "compile" dependencies and submit with the job, which
>> will then have a conflict with the libraries on the cluster. On the other
>> hand, the user will always want to override the "provided" version to match
>> the cluster, in which case it will just be harmless duplicates on the
>> classpath, no? I guess huge file size, but it isn't the 90s any more. Since
>> Ismaël commented, maybe he can help to clarify. I also knew about this
>> reasoning for Spark & Hadoop but I don't know exactly what is required to
>> make it work right.
>>
>> This could become a bothersome issue long term - Gradle dev community has
>> lots of posts that indicate they don't agree with the existence of
>> "provided" or "optional" dependencies. (I happen to agree with them, but
>> philosophy is not the point). We should have a very clear solution for the
>> cases that require one, and document at least on the wiki.
>>
>> Kenn
>>
>>
>>>
>>> -Daniel
>>>
>>> On Thu, Jan 13, 2022 at 4:19 PM Ismaël Mejía  wrote:
>>>
 Optional dependencies should not be a major issue.

 What matters to validate that we are not breaking users is to compare
 the generated POM files with the previous (pre gradle 7 / 2.35.0)
 version and see that what was provided is still provided.

 In particular the Hadoop/Spark and Kafka dependencies must be
 **provided** as they were. I am not sure of others but those three
 matter.

 Ismaël

 On Wed, Jan 12, 2022 at 10:55 PM Emily Ye  wrote:
 >
 > We've chatted offline and have a tentative plan for what to do with
 these dependencies that are currently marked as compileOnly (instead of
 provided). Please review the list if possible [1].
 >
 > Two projects we aren't sure about:
 >
 > :sdks:java:io:hcatalog
 >
 > library.java.jackson_annotations
 > library.java.jackson_core
 > library.java.jackson_databind
 > library.java.hadoop_common
 > org.apache.hive:hive-exec
 > org.apache.hive.hcatalog:hive-hcatalog-core
 >
 > :sdks:java:io:parquet
 >
 > library.java.hadoop_client
 >
 >
 > Does anyone have experience with either of these IOs? ccing Chamikara
 >
 > Thank you,
 > Emily
 >
 >
 > [1]
 https://docs.google.com/spreadsheets/d/1UpeQtx1PoAgeSmpKxZC9lv3B9G1c7cryW3iICfRtG1o/edit?usp=sharing
 >
 > On Tue, Jan 11, 2022 at 6:38 PM Emily Ye  wrote:
 >>
 >> As the person volunteering to do fixes for this to unblock Beam
 2.36.0, I created a spreadsheet of the projects with dependencies changed
 from provided to compile only [1]. I pre-filled with what I think things
 should be, but I don't have very much background in java/maven/gradle
 configurations so please 

Re: [DISCUSS] propdeps removal and what to do going forward

2022-01-18 Thread Kenneth Knowles
On Fri, Jan 14, 2022 at 9:34 AM Daniel Collins  wrote:

> > In particular the Hadoop/Spark and Kafka dependencies must be
> **provided** as they were. I am not sure of others but those three matter.
>
> I think there's a bit of a difference here between what should be the
> state in the short term versus the long term.
>
> In the short term, I agree that we should avoid changes to how these
> dependencies are reflected in the POM.
>
> In the long term, I don't think it makes sense for these to continue to be
> "provided" dependencies- if users wish to use a different version of
> hadoop, spark or kafka, they can explicitly override the dependencies with
> the version they want when building their JAR, even if there is a version
> listed as "compile" in the POM file on maven central. The only difference
> is that if they don't have a version preference, the one listed in the POM
> (that we tested with) will be used, which seems like an unambiguous win to
> me.
>

Agree with the sentiment. But I believe the issue is that some tooling will
bundle up the "compile" dependencies and submit with the job, which will
then have a conflict with the libraries on the cluster. On the other hand,
the user will always want to override the "provided" version to match the
cluster, in which case it will just be harmless duplicates on the
classpath, no? I guess huge file size, but it isn't the 90s any more. Since
Ismaël commented, maybe he can help to clarify. I also knew about this
reasoning for Spark & Hadoop but I don't know exactly what is required to
make it work right.

This could become a bothersome issue long term - Gradle dev community has
lots of posts that indicate they don't agree with the existence of
"provided" or "optional" dependencies. (I happen to agree with them, but
philosophy is not the point). We should have a very clear solution for the
cases that require one, and document at least on the wiki.

Kenn


>
> -Daniel
>
> On Thu, Jan 13, 2022 at 4:19 PM Ismaël Mejía  wrote:
>
>> Optional dependencies should not be a major issue.
>>
>> What matters to validate that we are not breaking users is to compare
>> the generated POM files with the previous (pre gradle 7 / 2.35.0)
>> version and see that what was provided is still provided.
>>
>> In particular the Hadoop/Spark and Kafka dependencies must be
>> **provided** as they were. I am not sure of others but those three
>> matter.
>>
>> Ismaël
>>
>> On Wed, Jan 12, 2022 at 10:55 PM Emily Ye  wrote:
>> >
>> > We've chatted offline and have a tentative plan for what to do with
>> these dependencies that are currently marked as compileOnly (instead of
>> provided). Please review the list if possible [1].
>> >
>> > Two projects we aren't sure about:
>> >
>> > :sdks:java:io:hcatalog
>> >
>> > library.java.jackson_annotations
>> > library.java.jackson_core
>> > library.java.jackson_databind
>> > library.java.hadoop_common
>> > org.apache.hive:hive-exec
>> > org.apache.hive.hcatalog:hive-hcatalog-core
>> >
>> > :sdks:java:io:parquet
>> >
>> > library.java.hadoop_client
>> >
>> >
>> > Does anyone have experience with either of these IOs? ccing Chamikara
>> >
>> > Thank you,
>> > Emily
>> >
>> >
>> > [1]
>> https://docs.google.com/spreadsheets/d/1UpeQtx1PoAgeSmpKxZC9lv3B9G1c7cryW3iICfRtG1o/edit?usp=sharing
>> >
>> > On Tue, Jan 11, 2022 at 6:38 PM Emily Ye  wrote:
>> >>
>> >> As the person volunteering to do fixes for this to unblock Beam
>> 2.36.0, I created a spreadsheet of the projects with dependencies changed
>> from provided to compile only [1]. I pre-filled with what I think things
>> should be, but I don't have very much background in java/maven/gradle
>> configurations so please give input!
>> >>
>> >> Some (mainly hadoop/kafka) I left blank, since I'm not sure - do we
>> keep them provided because it depends on the user's version?
>> >>
>> >> [1]
>> https://docs.google.com/spreadsheets/d/1UpeQtx1PoAgeSmpKxZC9lv3B9G1c7cryW3iICfRtG1o/edit?usp=sharing
>> >>
>> >> On Tue, Jan 11, 2022 at 1:17 PM Luke Cwik  wrote:
>> >>>
>> >>> I'm not following what you're trying to say Kenn since provided in
>> maven requires the user to explicitly add the dependency themselves to have
>> it part of their runtime.
>> >>>
>> >>> As per
>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#dependency-scope
>> >>> "
>> >>> * provided
>> >>> This is much like compile, but indicates you expect the JDK or a
>> container to provide the dependency at runtime. For example, when building
>> a web application for the Java Enterprise Edition, you would set the
>> dependency on the Servlet API and related Java EE APIs to scope provided
>> because the web container provides those classes. A dependency with this
>> scope is added to the classpath used for compilation and test, but not the
>> runtime classpath. It is not transitive."
>> >>>
>> >>> On Tue, Jan 11, 2022 at 11:54 AM Kenneth Knowles 
>> wrote:
>> 
>>  To clarify: "provided" should have 

Re: [DISCUSS] propdeps removal and what to do going forward

2022-01-13 Thread Ismaël Mejía
Optional dependencies should not be a major issue.

What matters to validate that we are not breaking users is to compare
the generated POM files with the previous (pre gradle 7 / 2.35.0)
version and see that what was provided is still provided.

In particular the Hadoop/Spark and Kafka dependencies must be
**provided** as they were. I am not sure of others but those three
matter.

Ismaël

On Wed, Jan 12, 2022 at 10:55 PM Emily Ye  wrote:
>
> We've chatted offline and have a tentative plan for what to do with these 
> dependencies that are currently marked as compileOnly (instead of provided). 
> Please review the list if possible [1].
>
> Two projects we aren't sure about:
>
> :sdks:java:io:hcatalog
>
> library.java.jackson_annotations
> library.java.jackson_core
> library.java.jackson_databind
> library.java.hadoop_common
> org.apache.hive:hive-exec
> org.apache.hive.hcatalog:hive-hcatalog-core
>
> :sdks:java:io:parquet
>
> library.java.hadoop_client
>
>
> Does anyone have experience with either of these IOs? ccing Chamikara
>
> Thank you,
> Emily
>
>
> [1] 
> https://docs.google.com/spreadsheets/d/1UpeQtx1PoAgeSmpKxZC9lv3B9G1c7cryW3iICfRtG1o/edit?usp=sharing
>
> On Tue, Jan 11, 2022 at 6:38 PM Emily Ye  wrote:
>>
>> As the person volunteering to do fixes for this to unblock Beam 2.36.0, I 
>> created a spreadsheet of the projects with dependencies changed from 
>> provided to compile only [1]. I pre-filled with what I think things should 
>> be, but I don't have very much background in java/maven/gradle 
>> configurations so please give input!
>>
>> Some (mainly hadoop/kafka) I left blank, since I'm not sure - do we keep 
>> them provided because it depends on the user's version?
>>
>> [1] 
>> https://docs.google.com/spreadsheets/d/1UpeQtx1PoAgeSmpKxZC9lv3B9G1c7cryW3iICfRtG1o/edit?usp=sharing
>>
>> On Tue, Jan 11, 2022 at 1:17 PM Luke Cwik  wrote:
>>>
>>> I'm not following what you're trying to say Kenn since provided in maven 
>>> requires the user to explicitly add the dependency themselves to have it 
>>> part of their runtime.
>>>
>>> As per 
>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#dependency-scope
>>> "
>>> * provided
>>> This is much like compile, but indicates you expect the JDK or a container 
>>> to provide the dependency at runtime. For example, when building a web 
>>> application for the Java Enterprise Edition, you would set the dependency 
>>> on the Servlet API and related Java EE APIs to scope provided because the 
>>> web container provides those classes. A dependency with this scope is added 
>>> to the classpath used for compilation and test, but not the runtime 
>>> classpath. It is not transitive."
>>>
>>> On Tue, Jan 11, 2022 at 11:54 AM Kenneth Knowles  wrote:

 To clarify: "provided" should have been in the test runtime configuration, 
 but not in the shipped runtime configuration (otherwise dep resolution for 
 users would pull in provided deps, which should not happen)

 On Thu, Dec 30, 2021 at 10:05 AM Luke Cwik  wrote:
>
> During the migration to Gradle 7[1] the propdeps plugin was removed[2] 
> since there wasn't a newer version that was compatible with Gradle 7 and 
> a replacement couldn't be found. All existing usages of "provided" were 
> moved to "compileOnly" and "compileOnly" is being mapped to the 
> "provided" maven scope in the generated pom files. This has lead to two 
> issues:
> 1) provided was also part of the runtime configuration, so we are getting 
> a few class not found exceptions when running tests [3]
> 2) the generated pom.xml will have a bunch of compile time only 
> annotations added as a provided dependency in the generated pom files[4]
>
> #1 can be fixed by adding the dependency to both the "compileOnly" and 
> "runtimeOnly" configurations or by adding dependency to the 
> "implementation" configuration
> #2 will make the pom files messier which can lead to confusion for users 
> but shouldn't impact existing uses.
>
> There was a suggestion[4] to completely remove the usage of provided from 
> the generated pom.xml and have all our previously "provided" dependencies 
> declared as "implementation" allowing us to solve both #1 and #2 above.
>
> The largest usage of "provided" in the past was to packages related to 
> the hadoop ecosystem and afterwards it was for packages such as 
> junit/hamcrest/aircompressor in sdks/java/core which aren't required to 
> use the module but can provide additional features if the dependency 
> exists.
>
> What should we migrate if anything to the "implementation" configuration 
> or should we try to recreate what we were doing with the "provided" 
> configuration in the past?
>
> 1: https://issues.apache.org/jira/browse/BEAM-13430
> 2: https://github.com/apache/beam/pull/16308
> 3: 

Re: [DISCUSS] propdeps removal and what to do going forward

2022-01-11 Thread Kenneth Knowles
To clarify: "provided" should have been in the test runtime configuration,
but not in the shipped runtime configuration (otherwise dep resolution for
users would pull in provided deps, which should not happen)

On Thu, Dec 30, 2021 at 10:05 AM Luke Cwik  wrote:

> During the migration to Gradle 7[1] the propdeps plugin was removed[2]
> since there wasn't a newer version that was compatible with Gradle 7 and a
> replacement couldn't be found. All existing usages of "provided" were moved
> to "compileOnly" and "compileOnly" is being mapped to the "provided" maven
> scope in the generated pom files. This has lead to two issues:
> 1) provided was also part of the runtime configuration, so we are getting
> a few class not found exceptions when running tests [3]
> 2) the generated pom.xml will have a bunch of compile time only
> annotations added as a provided dependency in the generated pom files[4]
>
> #1 can be fixed by adding the dependency to both the "compileOnly" and
> "runtimeOnly" configurations or by adding dependency to the
> "implementation" configuration
> #2 will make the pom files messier which can lead to confusion for users
> but shouldn't impact existing uses.
>
> There was a suggestion[4] to completely remove the usage of provided from
> the generated pom.xml and have all our previously "provided" dependencies
> declared as "implementation" allowing us to solve both #1 and #2 above.
>
> The largest usage of "provided" in the past was to packages related to the
> hadoop ecosystem and afterwards it was for packages such as
> junit/hamcrest/aircompressor in sdks/java/core which aren't required to use
> the module but can provide additional features if the dependency exists.
>
> What should we migrate if anything to the "implementation" configuration
> or should we try to recreate what we were doing with the "provided"
> configuration in the past?
>
> 1: https://issues.apache.org/jira/browse/BEAM-13430
> 2: https://github.com/apache/beam/pull/16308
> 3: https://issues.apache.org/jira/browse/BEAM-13569
> 4:
> https://github.com/apache/beam/blob/fe456b79419d1a67ebf13d7d4b6695fa1aa6204d/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L964
> 5: https://issues.apache.org/jira/browse/BEAM-13504
>
>