I'm +0 on hbase-examples, but +1000000 on any improvements we can make to
ltt/pe/chaos/minicluster/etc. It's extremely frustrating how much reliance
we have on test jars both generally but also specifically around these core
test executables. Unfortunately I haven't had time to dedicate to these
frustrations myself, but happy to help with review, etc.

On Tue, Mar 5, 2024 at 1:03 PM Nihal Jain <nihaljain...@gmail.com> wrote:

> Thank you for bringing this up.
>
> +1 for this change.
>
> In fact, some time back, we had faced similar problem. Security scans found
> that we were bundling some vulnerable hadoop test jar. To deal with that we
> had to make a change in our internal HBase fork to exclude all HBase and
> Hadoop test jars from assembly. This helped us get rid of vulnerable jar.
> (Although I hadn't dealt with test scope dependencies there.)
>
> But, I have been thinking of pushing this change in Apache HBase, just
> wasn't sure if this was even acceptable. It's great to see same has been
> brought up here today.
>
> We hadn't dealt with the ltt, pe etc. tools and wrote a script to download
> them on demand to avoid massive code change in internal fork. But I have a
> +1 on the idea of identifying and moving all such tools to a new module.
> This would be great and make things easier for us as well.
>
> Also, a way we could help new users easily get started, in case we
> completely stop bundling hadoop jars, is by providing a script which starts
> a hbase cluster in a single node setup. In fact I had written a simple
> script sometime back that automates this process given a release link for
> both. It first downloads Hadoop and HBase binaries and then starts both
> with the hbase root directory set to be on hdfs. We could provide something
> similar to help new users to get started easily.
>
> Although I am also +1 on the idea to provide both variants as mentioned by
> Nick, which might not even need any such script.
>
> Also, I am willing to volunteer for help towards this effort. Please let me
> know if anything is needed.
>
> Thanks,
> Nihal
>
>
> On Tue, 5 Mar 2024, 15:35 Nick Dimiduk, <ndimi...@apache.org> wrote:
>
> > This would be great cleanup, big +1 from me for all three of these
> > adjustments, including the promotion of pe, ltt, and friends out of the
> > test scope.
> >
> > I believe that we included hbase test jars because we used to freely mix
> > classes needed for minicluster between runtime and test jars, which in
> turn
> > relied on Hadoop minicluster capabilities. The big cleanup around
> > HBaseTestingUtil/it addressed much (or all) of these issues on branch-3.
> >
> > I believe that we include a Hadoop distribution in our assembly because
> > that makes it easy for a new user to download our release bin.tgz and get
> > started immediately with learning. I guess it’s high time that we work
> out
> > the with- and without-Hadoop variants.
> >
> > Thanks,
> > Nick
> >
> > On Tue, 5 Mar 2024 at 09:14, Istvan Toth <st...@apache.org> wrote:
> >
> > > DISCLAIMER: I don't have a patch ready, or even an elegant way mapped
> out
> > > to achieve this, this is about discussing whether we even want to make
> > > these changes.
> > > These are also substantial changes, but they could be targeted for
> HBase
> > > 3.0.
> > >
> > > One issue I have noticed is that we ship test jars and test
> dependencies
> > in
> > > the assembly.
> > > I can't see anyone using those, but it bloats the assembly and
> classpath,
> > > and adds unnecessary JARs with possible CVE issues. (for example Kerby
> > > which is a Hadoop minicluster dependency)
> > >
> > > My proposal is to exclude the test jars and the test scope dependencies
> > > from the assembly.
> > >
> > > The advantages would be:
> > > * Smaller distro size
> > > * Faster startup (this is marginal)
> > > * Less CVE-prone JARs in the binary assemblies
> > >
> > > The other issue is that the assembly includes much of the Hadoop
> > > distribution.
> > > The basic assumption in all scripts and instructions is that the node
> > has a
> > > fully configured Hadoop installation, and we include it in the
> classpath
> > of
> > > HBase.
> > >
> > > If that is true, then there is no reason to include Hadoop in the
> > assembly,
> > > HBase and its direct dependencies should be enough.
> > >
> > > One could argue that it would simplify the client side, which is true
> to
> > > some extent (though 95% of the client distro use cases are served
> better
> > by
> > > simply using hbase-shaded-client).
> > >
> > > We could either remove the Hadoop libraries from either or both of the
> > > assemblies unconditionally, or provide two variants for either or both
> > > assemblies, one with Hadoop included, and one without it.
> > > Spark already does this, it has binary distributions both with and
> > without
> > > Hadoop.
> > >
> > > The advantages would be:
> > > * Smaller distro size
> > > * Faster startup (this is marginal)
> > > * Less chance of conflicts with the Hadoop jars
> > > * Less CVE-prone JARs in the binary assemblies
> > >
> > >
> > > Thirdly, we could consider excluding the
> > > full-fat org.apache.hbase:hbase-shaded-client JAR from the Hadoop-less
> > > binary assemblies. It is not used by the assembly, and AFAIK it is not
> > > included in any of the 'hbase classpath' command variants.
> > >
> > > This would make sure that no Hadoop libraries are included (even in
> > shaded
> > > form) and would make the HBase distribution fully insulated from
> Hadoop's
> > > CVE issues.
> > >
> > > (The full-fat hbase-shaded-client works best as direct build-time
> > > dependency anyway)
> > >
> > > best regards
> > > Istvan
> > >
> >
>

Reply via email to