I'm +0 on hbase-examples, but +1000000 on any improvements we can make to ltt/pe/chaos/minicluster/etc. It's extremely frustrating how much reliance we have on test jars both generally but also specifically around these core test executables. Unfortunately I haven't had time to dedicate to these frustrations myself, but happy to help with review, etc.
On Tue, Mar 5, 2024 at 1:03 PM Nihal Jain <nihaljain...@gmail.com> wrote: > Thank you for bringing this up. > > +1 for this change. > > In fact, some time back, we had faced similar problem. Security scans found > that we were bundling some vulnerable hadoop test jar. To deal with that we > had to make a change in our internal HBase fork to exclude all HBase and > Hadoop test jars from assembly. This helped us get rid of vulnerable jar. > (Although I hadn't dealt with test scope dependencies there.) > > But, I have been thinking of pushing this change in Apache HBase, just > wasn't sure if this was even acceptable. It's great to see same has been > brought up here today. > > We hadn't dealt with the ltt, pe etc. tools and wrote a script to download > them on demand to avoid massive code change in internal fork. But I have a > +1 on the idea of identifying and moving all such tools to a new module. > This would be great and make things easier for us as well. > > Also, a way we could help new users easily get started, in case we > completely stop bundling hadoop jars, is by providing a script which starts > a hbase cluster in a single node setup. In fact I had written a simple > script sometime back that automates this process given a release link for > both. It first downloads Hadoop and HBase binaries and then starts both > with the hbase root directory set to be on hdfs. We could provide something > similar to help new users to get started easily. > > Although I am also +1 on the idea to provide both variants as mentioned by > Nick, which might not even need any such script. > > Also, I am willing to volunteer for help towards this effort. Please let me > know if anything is needed. > > Thanks, > Nihal > > > On Tue, 5 Mar 2024, 15:35 Nick Dimiduk, <ndimi...@apache.org> wrote: > > > This would be great cleanup, big +1 from me for all three of these > > adjustments, including the promotion of pe, ltt, and friends out of the > > test scope. > > > > I believe that we included hbase test jars because we used to freely mix > > classes needed for minicluster between runtime and test jars, which in > turn > > relied on Hadoop minicluster capabilities. The big cleanup around > > HBaseTestingUtil/it addressed much (or all) of these issues on branch-3. > > > > I believe that we include a Hadoop distribution in our assembly because > > that makes it easy for a new user to download our release bin.tgz and get > > started immediately with learning. I guess it’s high time that we work > out > > the with- and without-Hadoop variants. > > > > Thanks, > > Nick > > > > On Tue, 5 Mar 2024 at 09:14, Istvan Toth <st...@apache.org> wrote: > > > > > DISCLAIMER: I don't have a patch ready, or even an elegant way mapped > out > > > to achieve this, this is about discussing whether we even want to make > > > these changes. > > > These are also substantial changes, but they could be targeted for > HBase > > > 3.0. > > > > > > One issue I have noticed is that we ship test jars and test > dependencies > > in > > > the assembly. > > > I can't see anyone using those, but it bloats the assembly and > classpath, > > > and adds unnecessary JARs with possible CVE issues. (for example Kerby > > > which is a Hadoop minicluster dependency) > > > > > > My proposal is to exclude the test jars and the test scope dependencies > > > from the assembly. > > > > > > The advantages would be: > > > * Smaller distro size > > > * Faster startup (this is marginal) > > > * Less CVE-prone JARs in the binary assemblies > > > > > > The other issue is that the assembly includes much of the Hadoop > > > distribution. > > > The basic assumption in all scripts and instructions is that the node > > has a > > > fully configured Hadoop installation, and we include it in the > classpath > > of > > > HBase. > > > > > > If that is true, then there is no reason to include Hadoop in the > > assembly, > > > HBase and its direct dependencies should be enough. > > > > > > One could argue that it would simplify the client side, which is true > to > > > some extent (though 95% of the client distro use cases are served > better > > by > > > simply using hbase-shaded-client). > > > > > > We could either remove the Hadoop libraries from either or both of the > > > assemblies unconditionally, or provide two variants for either or both > > > assemblies, one with Hadoop included, and one without it. > > > Spark already does this, it has binary distributions both with and > > without > > > Hadoop. > > > > > > The advantages would be: > > > * Smaller distro size > > > * Faster startup (this is marginal) > > > * Less chance of conflicts with the Hadoop jars > > > * Less CVE-prone JARs in the binary assemblies > > > > > > > > > Thirdly, we could consider excluding the > > > full-fat org.apache.hbase:hbase-shaded-client JAR from the Hadoop-less > > > binary assemblies. It is not used by the assembly, and AFAIK it is not > > > included in any of the 'hbase classpath' command variants. > > > > > > This would make sure that no Hadoop libraries are included (even in > > shaded > > > form) and would make the HBase distribution fully insulated from > Hadoop's > > > CVE issues. > > > > > > (The full-fat hbase-shaded-client works best as direct build-time > > > dependency anyway) > > > > > > best regards > > > Istvan > > > > > >