Re: [DISCUSS] Adding support for Hadoop 3 and removing flink-shaded-hadoop

Stephan Ewen Sun, 26 Apr 2020 13:26:36 -0700

Indeed, that would be the assumption, that Hadoop does not expose its
transitive libraries on its public API surface.


>From vague memory, I think that pretty much true so far. I only remember
Kinesis and Calcite as counter examples, who exposed Guava classes as part
of the public API.
But that is definitely the "weak spot" of this approach. Plus, as with all
custom class loaders, the fact that the Thread Context Class Loader does
not really work well any more.

On Thu, Apr 23, 2020 at 11:50 AM Chesnay Schepler <[email protected]>
wrote:

> This would only work so long as all Hadoop APIs do not directly expose
> any transitive non-hadoop dependency.
> Otherwise the user code classloader might search for this transitive
> dependency in lib instead of the hadoop classpath (and possibly not find
> it).
>
> On 23/04/2020 11:34, Stephan Ewen wrote:
> > True, connectors built on Hadoop make this a bit more complex. That is
> also
> > the reason why Hadoop is on the "parent first" patterns.
> >
> > Maybe this is a bit of a wild thought, but what would happen if we had a
> > "first class" notion of a Hadoop Classloader in the system, and the user
> > code classloader would explicitly fall back to that one whenever a class
> > whose name starts with "org.apache.hadoop" is not found? We could also
> > generalize this by associating plugin loaders with class name prefixes.
> >
> > Then it would try to load from the user code jar, and if the class was
> not
> > found, load it from the hadoop classpath.
> >
> > On Thu, Apr 23, 2020 at 10:56 AM Chesnay Schepler <[email protected]>
> > wrote:
> >
> >> although, if you can load the HADOOP_CLASSPATH as a plugin, then you can
> >> also load it in the user-code classloader.
> >>
> >> On 23/04/2020 10:50, Chesnay Schepler wrote:
> >>> @Stephan I'm not aware of anyone having tried that; possibly since we
> >>> have various connectors that require hadoop (hadoop-compat, hive,
> >>> orc/parquet/hbase, hadoop inputformats). This would require connectors
> >>> to be loaded as plugins (or having access to the plugin classloader)
> >>> to be feasible.
> >>>
> >>> On 23/04/2020 09:59, Stephan Ewen wrote:
> >>>> Hi all!
> >>>>
> >>>> +1 for the simplification of dropping hadoop-shaded
> >>>>
> >>>>
> >>>> Have we ever investigated how much work it would be to load the
> >>>> HADOOP_CLASSPATH through the plugin loader? Then Hadoop's crazy
> >>>> dependency
> >>>> footprint would not spoil the main classpath.
> >>>>
> >>>>     - HDFS might be very simple, because file systems are already
> >>>> Plugin aware
> >>>>     - Yarn would need some extra work. In essence, we would need to
> >>>> discover
> >>>> executors also through plugins
> >>>>     - Kerberos is the other remaining bit. We would need to switch
> >>>> security
> >>>> modules to ServiceLoaders (which we should do anyways) and also pull
> >>>> them
> >>>> from plugins.
> >>>>
> >>>> Best,
> >>>> Stephan
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Apr 23, 2020 at 4:05 AM Xintong Song <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> +1 for supporting Hadoop 3.
> >>>>>
> >>>>> I'm not familiar with the shading efforts, thus no comment on
> >>>>> dropping the
> >>>>> flink-shaded-hadoop.
> >>>>>
> >>>>>
> >>>>> Correct me if I'm wrong. Despite currently the default Hadoop
> >>>>> version for
> >>>>> compiling is 2.4.1 in Flink, I think this does not mean Flink should
> >>>>> support only Hadoop 2.4+. So no matter which Hadoop version we use
> for
> >>>>> compiling by default, we need to use reflection for the Hadoop
> >>>>> features/APIs that are not supported in all versions anyway.
> >>>>>
> >>>>>
> >>>>> There're already many such reflections in `YarnClusterDescriptor` and
> >>>>> `YarnResourceManager`, and might be more in future. I'm wondering
> >>>>> whether
> >>>>> we should have a unified mechanism (an interface / abstract class or
> >>>>> so)
> >>>>> that handles all these kind of Hadoop API reflections at one place.
> Not
> >>>>> necessarily in the scope to this discussion though.
> >>>>>
> >>>>>
> >>>>> Thank you~
> >>>>>
> >>>>> Xintong Song
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Apr 22, 2020 at 8:32 PM Chesnay Schepler <[email protected]
> >
> >>>>> wrote:
> >>>>>
> >>>>>> 1) Likely not, as this again introduces a hard-dependency on
> >>>>>> flink-shaded-hadoop.
> >>>>>> 2) Indeed; this will be something the user/cloud providers have to
> >>>>>> deal
> >>>>>> with now.
> >>>>>> 3) Yes.
> >>>>>>
> >>>>>> As a small note, we can still keep the hadoop-2 version of
> >>>>>> flink-shaded
> >>>>>> around for existing users.
> >>>>>> What I suggested was to just not release hadoop-3 versions.
> >>>>>>
> >>>>>> On 22/04/2020 14:19, Yang Wang wrote:
> >>>>>>> Thanks Robert for starting this significant discussion.
> >>>>>>>
> >>>>>>> Since hadoop3 has been released for long time and many companies
> have
> >>>>>>> already
> >>>>>>> put it in production. No matter you are using flink-shaded-hadoop2
> or
> >>>>>> not,
> >>>>>>> currently
> >>>>>>> Flink could already run in yarn3(not sure about HDFS). Since the
> yarn
> >>>>> api
> >>>>>>> is always
> >>>>>>> backward compatible. The difference is we could not benefit from
> the
> >>>>> new
> >>>>>>> features
> >>>>>>> because we are using hadoop-2.4 as compile dependency. So then we
> >>>>>>> need
> >>>>> to
> >>>>>>> use
> >>>>>>> reflector for new features(node label, tags, etc.).
> >>>>>>>
> >>>>>>> All in all, i am in in favour of dropping the flink-shaded-hadoop.
> >>>>>>> Just
> >>>>>>> have some questions.
> >>>>>>> 1. Do we still support "-include-hadoop" profile? If yes, what we
> >>>>>>> will
> >>>>>> get
> >>>>>>> in the lib dir?
> >>>>>>> 2. I am not sure whether dropping the flink-shaded-hadoop will take
> >>>>> some
> >>>>>>> class conflicts
> >>>>>>> problems. If we use "export HADOOP_CLASSPATH=`hadoop classpath`"
> for
> >>>>> the
> >>>>>>> hadoop
> >>>>>>> env setup, then many jars will be appended to the Flink client
> >>>>> classpath.
> >>>>>>> 3. The compile hadoop version is still 2.4.1. Right?
> >>>>>>>
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Yang
> >>>>>>>
> >>>>>>>
> >>>>>>> Sivaprasanna <[email protected]> 于2020年4月22日周三
> >>>>>>> 下午4:18写道：
> >>>>>>>
> >>>>>>>> I agree with Aljoscha. Otherwise I can see a lot of tickets
> getting
> >>>>>> created
> >>>>>>>> saying the application is not running on YARN.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Sivaprasanna
> >>>>>>>>
> >>>>>>>> On Wed, Apr 22, 2020 at 1:00 PM Aljoscha Krettek
> >>>>>>>> <[email protected]
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> +1 to getting rid of flink-shaded-hadoop. But we need to
> >>>>>>>>> document how
> >>>>>>>>> people can now get a Flink dist that works with Hadoop.
> Currently,
> >>>>> when
> >>>>>>>>> you download the single shaded jar you immediately get support
> for
> >>>>>>>>> submitting to YARN via bin/flink run.
> >>>>>>>>>
> >>>>>>>>> Aljoscha
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 22.04.20 09:08, Till Rohrmann wrote:
> >>>>>>>>>> Hi Robert,
> >>>>>>>>>>
> >>>>>>>>>> I think it would be a helpful simplification of Flink's build
> >>>>>>>>>> setup
> >>>>> if
> >>>>>>>> we
> >>>>>>>>>> can get rid of flink-shaded-hadoop. Moreover relying only on the
> >>>>>>>> vanilla
> >>>>>>>>>> Hadoop dependencies for the modules which interact with
> >>>>>>>>>> Hadoop/Yarn
> >>>>>>>>> sounds
> >>>>>>>>>> like a good idea to me.
> >>>>>>>>>>
> >>>>>>>>>> Adding support for Hadoop 3 would also be nice. I'm not sure,
> >>>>> though,
> >>>>>>>> how
> >>>>>>>>>> Hadoop's API's have changed between 2 and 3. It might be
> necessary
> >>>>> to
> >>>>>>>>>> introduce some bridges in order to make it work.
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>> Till
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Apr 21, 2020 at 4:37 PM Robert Metzger
> >>>>>>>>>> <[email protected]
> >>>>>>>>> wrote:
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> for the upcoming 1.11 release, I started looking into adding
> >>>>> support
> >>>>>>>> for
> >>>>>>>>>>> Hadoop 3[1] for Flink. I have explored a little bit already
> into
> >>>>>>>> adding
> >>>>>>>>> a
> >>>>>>>>>>> shaded hadoop 3 into “flink-shaded”, and some mechanisms for
> >>>>>> switching
> >>>>>>>>>>> between Hadoop 2 and 3 dependencies in the Flink build.
> >>>>>>>>>>>
> >>>>>>>>>>> However, Chesnay made me aware that we could also go a
> different
> >>>>>>>> route:
> >>>>>>>>> We
> >>>>>>>>>>> let Flink depend on vanilla Hadoop dependencies and stop
> >>>>>>>>>>> providing
> >>>>>>>>> shaded
> >>>>>>>>>>> fat jars for Hadoop through “flink-shaded”.
> >>>>>>>>>>>
> >>>>>>>>>>> Why?
> >>>>>>>>>>> - Maintaining properly shaded Hadoop fat jars is a lot of work
> >>>>>>>>>>> (we
> >>>>>>>> have
> >>>>>>>>>>> insufficient test coverage for all kinds of Hadoop features)
> >>>>>>>>>>> - For Hadoop 2, there are already some known and unresolved
> >>>>>>>>>>> issues
> >>>>>>>> with
> >>>>>>>>> our
> >>>>>>>>>>> shaded jars that we didn’t manage to fix
> >>>>>>>>>>>
> >>>>>>>>>>> Users will have to use Flink with Hadoop by relying on vanilla
> or
> >>>>>>>>>>> vendor-provided Hadoop dependencies.
> >>>>>>>>>>>
> >>>>>>>>>>> What do you think?
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Robert
> >>>>>>>>>>>
> >>>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-11086
> >>>>>>>>>>>
> >>>
> >>
>
>

Re: [DISCUSS] Adding support for Hadoop 3 and removing flink-shaded-hadoop

Reply via email to