We seem to be in agreement in principle, however the devil is in the
details.

The first step should be moving the diagnostic tools out of the test jars.
Are there any tools we don't want to move out ?
Do the diagnostic tools pull in extra dependencies compared to the current
runtime JARs, and if they do, what are those ?
I haven't thought of the chaosmonkey tests yet, do those have specific
additional dependencies / scripts ?

Should we move the tools simply to the normal jars, or should we move them
to a new module (could be called hbase-diagnostics) ?

Istvan

On Tue, Mar 5, 2024 at 7:10 PM Bryan Beaudreault <bbeaudrea...@apache.org>
wrote:

> I'm +0 on hbase-examples, but +1000000 on any improvements we can make to
> ltt/pe/chaos/minicluster/etc. It's extremely frustrating how much reliance
> we have on test jars both generally but also specifically around these core
> test executables. Unfortunately I haven't had time to dedicate to these
> frustrations myself, but happy to help with review, etc.
>
> On Tue, Mar 5, 2024 at 1:03 PM Nihal Jain <nihaljain...@gmail.com> wrote:
>
> > Thank you for bringing this up.
> >
> > +1 for this change.
> >
> > In fact, some time back, we had faced similar problem. Security scans
> found
> > that we were bundling some vulnerable hadoop test jar. To deal with that
> we
> > had to make a change in our internal HBase fork to exclude all HBase and
> > Hadoop test jars from assembly. This helped us get rid of vulnerable jar.
> > (Although I hadn't dealt with test scope dependencies there.)
> >
> > But, I have been thinking of pushing this change in Apache HBase, just
> > wasn't sure if this was even acceptable. It's great to see same has been
> > brought up here today.
> >
> > We hadn't dealt with the ltt, pe etc. tools and wrote a script to
> download
> > them on demand to avoid massive code change in internal fork. But I have
> a
> > +1 on the idea of identifying and moving all such tools to a new module.
> > This would be great and make things easier for us as well.
> >
> > Also, a way we could help new users easily get started, in case we
> > completely stop bundling hadoop jars, is by providing a script which
> starts
> > a hbase cluster in a single node setup. In fact I had written a simple
> > script sometime back that automates this process given a release link for
> > both. It first downloads Hadoop and HBase binaries and then starts both
> > with the hbase root directory set to be on hdfs. We could provide
> something
> > similar to help new users to get started easily.
> >
> > Although I am also +1 on the idea to provide both variants as mentioned
> by
> > Nick, which might not even need any such script.
> >
> > Also, I am willing to volunteer for help towards this effort. Please let
> me
> > know if anything is needed.
> >
> > Thanks,
> > Nihal
> >
> >
> > On Tue, 5 Mar 2024, 15:35 Nick Dimiduk, <ndimi...@apache.org> wrote:
> >
> > > This would be great cleanup, big +1 from me for all three of these
> > > adjustments, including the promotion of pe, ltt, and friends out of the
> > > test scope.
> > >
> > > I believe that we included hbase test jars because we used to freely
> mix
> > > classes needed for minicluster between runtime and test jars, which in
> > turn
> > > relied on Hadoop minicluster capabilities. The big cleanup around
> > > HBaseTestingUtil/it addressed much (or all) of these issues on
> branch-3.
> > >
> > > I believe that we include a Hadoop distribution in our assembly because
> > > that makes it easy for a new user to download our release bin.tgz and
> get
> > > started immediately with learning. I guess it’s high time that we work
> > out
> > > the with- and without-Hadoop variants.
> > >
> > > Thanks,
> > > Nick
> > >
> > > On Tue, 5 Mar 2024 at 09:14, Istvan Toth <st...@apache.org> wrote:
> > >
> > > > DISCLAIMER: I don't have a patch ready, or even an elegant way mapped
> > out
> > > > to achieve this, this is about discussing whether we even want to
> make
> > > > these changes.
> > > > These are also substantial changes, but they could be targeted for
> > HBase
> > > > 3.0.
> > > >
> > > > One issue I have noticed is that we ship test jars and test
> > dependencies
> > > in
> > > > the assembly.
> > > > I can't see anyone using those, but it bloats the assembly and
> > classpath,
> > > > and adds unnecessary JARs with possible CVE issues. (for example
> Kerby
> > > > which is a Hadoop minicluster dependency)
> > > >
> > > > My proposal is to exclude the test jars and the test scope
> dependencies
> > > > from the assembly.
> > > >
> > > > The advantages would be:
> > > > * Smaller distro size
> > > > * Faster startup (this is marginal)
> > > > * Less CVE-prone JARs in the binary assemblies
> > > >
> > > > The other issue is that the assembly includes much of the Hadoop
> > > > distribution.
> > > > The basic assumption in all scripts and instructions is that the node
> > > has a
> > > > fully configured Hadoop installation, and we include it in the
> > classpath
> > > of
> > > > HBase.
> > > >
> > > > If that is true, then there is no reason to include Hadoop in the
> > > assembly,
> > > > HBase and its direct dependencies should be enough.
> > > >
> > > > One could argue that it would simplify the client side, which is true
> > to
> > > > some extent (though 95% of the client distro use cases are served
> > better
> > > by
> > > > simply using hbase-shaded-client).
> > > >
> > > > We could either remove the Hadoop libraries from either or both of
> the
> > > > assemblies unconditionally, or provide two variants for either or
> both
> > > > assemblies, one with Hadoop included, and one without it.
> > > > Spark already does this, it has binary distributions both with and
> > > without
> > > > Hadoop.
> > > >
> > > > The advantages would be:
> > > > * Smaller distro size
> > > > * Faster startup (this is marginal)
> > > > * Less chance of conflicts with the Hadoop jars
> > > > * Less CVE-prone JARs in the binary assemblies
> > > >
> > > >
> > > > Thirdly, we could consider excluding the
> > > > full-fat org.apache.hbase:hbase-shaded-client JAR from the
> Hadoop-less
> > > > binary assemblies. It is not used by the assembly, and AFAIK it is
> not
> > > > included in any of the 'hbase classpath' command variants.
> > > >
> > > > This would make sure that no Hadoop libraries are included (even in
> > > shaded
> > > > form) and would make the HBase distribution fully insulated from
> > Hadoop's
> > > > CVE issues.
> > > >
> > > > (The full-fat hbase-shaded-client works best as direct build-time
> > > > dependency anyway)
> > > >
> > > > best regards
> > > > Istvan
> > > >
> > >
> >
>


-- 
*István Tóth* | Sr. Staff Software Engineer
*Email*: st...@cloudera.com
cloudera.com <https://www.cloudera.com>
[image: Cloudera] <https://www.cloudera.com/>
[image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera
on LinkedIn] <https://www.linkedin.com/company/cloudera>
------------------------------
------------------------------

Reply via email to