Re: [DISCUSS] Removing tests and/or Hadoop from the binary assemblies

2024-03-08 Thread Nihal Jain
I have created sub tasks with necessary details in the umbrella jira. Will
take them up in coming days. Also will add more sub tasks later if needed.

Regards
Nihal

On Sat, 9 Mar 2024, 11:53 Istvan Toth,  wrote:

> Thank you Nihal.
> I'm not very familiar with the tools in the test code, so you can probably
> plan that work better.
> I just have some generic steps in mind:
> * Identify all the tools / scripts in the test jars
> * Identify and analyze their dependencies (compared to the current runtime
> deps)
> * Decide which ones to move to the runtime JARs.
> * Move them to the runtime code (or perhaps a separate module)
>
> I have created https://issues.apache.org/jira/browse/HBASE-28431 as an
> umbrella ticket to organize the sub-tasks.
>
> Istvan
>
> On Fri, Mar 8, 2024 at 7:06 PM Nihal Jain  wrote:
>
> > Sure I will be able to take up. Please create tasks with necessary
> details
> > or let me know if you want me to create.
> >
> > On Fri, 8 Mar 2024, 12:45 Istvan Toth, 
> wrote:
> >
> > > Thanks for volunteering, Nihal.
> > >
> > > I could work on the Hadoop-less, and assemblies, and you could work on
> > > cleaning up the test jars.
> > > Would that work for you ?
> > > I know that I'm picking the smaller part, but it turns out that I won't
> > > have as much time to work on this as I hoped.
> > >
> > > (Unless there are other volunteers, of course)
> > >
> > > Istvan
> > >
> > > On Wed, Mar 6, 2024 at 7:03 PM Istvan Toth  wrote:
> > >
> > > > We seem to be in agreement in principle, however the devil is in the
> > > > details.
> > > >
> > > > The first step should be moving the diagnostic tools out of the test
> > > jars.
> > > > Are there any tools we don't want to move out ?
> > > > Do the diagnostic tools pull in extra dependencies compared to the
> > > current
> > > > runtime JARs, and if they do, what are those ?
> > > > I haven't thought of the chaosmonkey tests yet, do those have
> specific
> > > > additional dependencies / scripts ?
> > > >
> > > > Should we move the tools simply to the normal jars, or should we move
> > > them
> > > > to a new module (could be called hbase-diagnostics) ?
> > > >
> > > > Istvan
> > > >
> > > > On Tue, Mar 5, 2024 at 7:10 PM Bryan Beaudreault <
> > > bbeaudrea...@apache.org>
> > > > wrote:
> > > >
> > > >> I'm +0 on hbase-examples, but +100 on any improvements we can
> make
> > > to
> > > >> ltt/pe/chaos/minicluster/etc. It's extremely frustrating how much
> > > reliance
> > > >> we have on test jars both generally but also specifically around
> these
> > > >> core
> > > >> test executables. Unfortunately I haven't had time to dedicate to
> > these
> > > >> frustrations myself, but happy to help with review, etc.
> > > >>
> > > >> On Tue, Mar 5, 2024 at 1:03 PM Nihal Jain 
> > > wrote:
> > > >>
> > > >> > Thank you for bringing this up.
> > > >> >
> > > >> > +1 for this change.
> > > >> >
> > > >> > In fact, some time back, we had faced similar problem. Security
> > scans
> > > >> found
> > > >> > that we were bundling some vulnerable hadoop test jar. To deal
> with
> > > >> that we
> > > >> > had to make a change in our internal HBase fork to exclude all
> HBase
> > > and
> > > >> > Hadoop test jars from assembly. This helped us get rid of
> vulnerable
> > > >> jar.
> > > >> > (Although I hadn't dealt with test scope dependencies there.)
> > > >> >
> > > >> > But, I have been thinking of pushing this change in Apache HBase,
> > just
> > > >> > wasn't sure if this was even acceptable. It's great to see same
> has
> > > been
> > > >> > brought up here today.
> > > >> >
> > > >> > We hadn't dealt with the ltt, pe etc. tools and wrote a script to
> > > >> download
> > > >> > them on demand to avoid massive code change in internal fork. But
> I
> > > >> have a
> > > >> > +1 on the idea of identifying and moving all such tools to a new
> > > module.
> > > >> > This would be great and make things easier for us as well.
> > > >> >
> > > >> > Also, a way we could help new users easily get started, in case we
> > > >> > completely stop bundling hadoop jars, is by providing a script
> which
> > > >> starts
> > > >> > a hbase cluster in a single node setup. In fact I had written a
> > simple
> > > >> > script sometime back that automates this process given a release
> > link
> > > >> for
> > > >> > both. It first downloads Hadoop and HBase binaries and then starts
> > > both
> > > >> > with the hbase root directory set to be on hdfs. We could provide
> > > >> something
> > > >> > similar to help new users to get started easily.
> > > >> >
> > > >> > Although I am also +1 on the idea to provide both variants as
> > > mentioned
> > > >> by
> > > >> > Nick, which might not even need any such script.
> > > >> >
> > > >> > Also, I am willing to volunteer for help towards this effort.
> Please
> > > >> let me
> > > >> > know if anything is needed.
> > > >> >
> > > >> > Thanks,
> > > >> > Nihal
> > > >> >
> > > >> >
> > > >> > On Tue, 5 Mar 2024, 15:35 Nick Dimiduk, 
> 

Re: [DISCUSS] Removing tests and/or Hadoop from the binary assemblies

2024-03-08 Thread Istvan Toth
Thank you Nihal.
I'm not very familiar with the tools in the test code, so you can probably
plan that work better.
I just have some generic steps in mind:
* Identify all the tools / scripts in the test jars
* Identify and analyze their dependencies (compared to the current runtime
deps)
* Decide which ones to move to the runtime JARs.
* Move them to the runtime code (or perhaps a separate module)

I have created https://issues.apache.org/jira/browse/HBASE-28431 as an
umbrella ticket to organize the sub-tasks.

Istvan

On Fri, Mar 8, 2024 at 7:06 PM Nihal Jain  wrote:

> Sure I will be able to take up. Please create tasks with necessary details
> or let me know if you want me to create.
>
> On Fri, 8 Mar 2024, 12:45 Istvan Toth,  wrote:
>
> > Thanks for volunteering, Nihal.
> >
> > I could work on the Hadoop-less, and assemblies, and you could work on
> > cleaning up the test jars.
> > Would that work for you ?
> > I know that I'm picking the smaller part, but it turns out that I won't
> > have as much time to work on this as I hoped.
> >
> > (Unless there are other volunteers, of course)
> >
> > Istvan
> >
> > On Wed, Mar 6, 2024 at 7:03 PM Istvan Toth  wrote:
> >
> > > We seem to be in agreement in principle, however the devil is in the
> > > details.
> > >
> > > The first step should be moving the diagnostic tools out of the test
> > jars.
> > > Are there any tools we don't want to move out ?
> > > Do the diagnostic tools pull in extra dependencies compared to the
> > current
> > > runtime JARs, and if they do, what are those ?
> > > I haven't thought of the chaosmonkey tests yet, do those have specific
> > > additional dependencies / scripts ?
> > >
> > > Should we move the tools simply to the normal jars, or should we move
> > them
> > > to a new module (could be called hbase-diagnostics) ?
> > >
> > > Istvan
> > >
> > > On Tue, Mar 5, 2024 at 7:10 PM Bryan Beaudreault <
> > bbeaudrea...@apache.org>
> > > wrote:
> > >
> > >> I'm +0 on hbase-examples, but +100 on any improvements we can make
> > to
> > >> ltt/pe/chaos/minicluster/etc. It's extremely frustrating how much
> > reliance
> > >> we have on test jars both generally but also specifically around these
> > >> core
> > >> test executables. Unfortunately I haven't had time to dedicate to
> these
> > >> frustrations myself, but happy to help with review, etc.
> > >>
> > >> On Tue, Mar 5, 2024 at 1:03 PM Nihal Jain 
> > wrote:
> > >>
> > >> > Thank you for bringing this up.
> > >> >
> > >> > +1 for this change.
> > >> >
> > >> > In fact, some time back, we had faced similar problem. Security
> scans
> > >> found
> > >> > that we were bundling some vulnerable hadoop test jar. To deal with
> > >> that we
> > >> > had to make a change in our internal HBase fork to exclude all HBase
> > and
> > >> > Hadoop test jars from assembly. This helped us get rid of vulnerable
> > >> jar.
> > >> > (Although I hadn't dealt with test scope dependencies there.)
> > >> >
> > >> > But, I have been thinking of pushing this change in Apache HBase,
> just
> > >> > wasn't sure if this was even acceptable. It's great to see same has
> > been
> > >> > brought up here today.
> > >> >
> > >> > We hadn't dealt with the ltt, pe etc. tools and wrote a script to
> > >> download
> > >> > them on demand to avoid massive code change in internal fork. But I
> > >> have a
> > >> > +1 on the idea of identifying and moving all such tools to a new
> > module.
> > >> > This would be great and make things easier for us as well.
> > >> >
> > >> > Also, a way we could help new users easily get started, in case we
> > >> > completely stop bundling hadoop jars, is by providing a script which
> > >> starts
> > >> > a hbase cluster in a single node setup. In fact I had written a
> simple
> > >> > script sometime back that automates this process given a release
> link
> > >> for
> > >> > both. It first downloads Hadoop and HBase binaries and then starts
> > both
> > >> > with the hbase root directory set to be on hdfs. We could provide
> > >> something
> > >> > similar to help new users to get started easily.
> > >> >
> > >> > Although I am also +1 on the idea to provide both variants as
> > mentioned
> > >> by
> > >> > Nick, which might not even need any such script.
> > >> >
> > >> > Also, I am willing to volunteer for help towards this effort. Please
> > >> let me
> > >> > know if anything is needed.
> > >> >
> > >> > Thanks,
> > >> > Nihal
> > >> >
> > >> >
> > >> > On Tue, 5 Mar 2024, 15:35 Nick Dimiduk, 
> wrote:
> > >> >
> > >> > > This would be great cleanup, big +1 from me for all three of these
> > >> > > adjustments, including the promotion of pe, ltt, and friends out
> of
> > >> the
> > >> > > test scope.
> > >> > >
> > >> > > I believe that we included hbase test jars because we used to
> freely
> > >> mix
> > >> > > classes needed for minicluster between runtime and test jars,
> which
> > in
> > >> > turn
> > >> > > relied on Hadoop minicluster capabilities. The big cleanup 

Re: [DISCUSS] Removing tests and/or Hadoop from the binary assemblies

2024-03-08 Thread Nihal Jain
Sure I will be able to take up. Please create tasks with necessary details
or let me know if you want me to create.

On Fri, 8 Mar 2024, 12:45 Istvan Toth,  wrote:

> Thanks for volunteering, Nihal.
>
> I could work on the Hadoop-less, and assemblies, and you could work on
> cleaning up the test jars.
> Would that work for you ?
> I know that I'm picking the smaller part, but it turns out that I won't
> have as much time to work on this as I hoped.
>
> (Unless there are other volunteers, of course)
>
> Istvan
>
> On Wed, Mar 6, 2024 at 7:03 PM Istvan Toth  wrote:
>
> > We seem to be in agreement in principle, however the devil is in the
> > details.
> >
> > The first step should be moving the diagnostic tools out of the test
> jars.
> > Are there any tools we don't want to move out ?
> > Do the diagnostic tools pull in extra dependencies compared to the
> current
> > runtime JARs, and if they do, what are those ?
> > I haven't thought of the chaosmonkey tests yet, do those have specific
> > additional dependencies / scripts ?
> >
> > Should we move the tools simply to the normal jars, or should we move
> them
> > to a new module (could be called hbase-diagnostics) ?
> >
> > Istvan
> >
> > On Tue, Mar 5, 2024 at 7:10 PM Bryan Beaudreault <
> bbeaudrea...@apache.org>
> > wrote:
> >
> >> I'm +0 on hbase-examples, but +100 on any improvements we can make
> to
> >> ltt/pe/chaos/minicluster/etc. It's extremely frustrating how much
> reliance
> >> we have on test jars both generally but also specifically around these
> >> core
> >> test executables. Unfortunately I haven't had time to dedicate to these
> >> frustrations myself, but happy to help with review, etc.
> >>
> >> On Tue, Mar 5, 2024 at 1:03 PM Nihal Jain 
> wrote:
> >>
> >> > Thank you for bringing this up.
> >> >
> >> > +1 for this change.
> >> >
> >> > In fact, some time back, we had faced similar problem. Security scans
> >> found
> >> > that we were bundling some vulnerable hadoop test jar. To deal with
> >> that we
> >> > had to make a change in our internal HBase fork to exclude all HBase
> and
> >> > Hadoop test jars from assembly. This helped us get rid of vulnerable
> >> jar.
> >> > (Although I hadn't dealt with test scope dependencies there.)
> >> >
> >> > But, I have been thinking of pushing this change in Apache HBase, just
> >> > wasn't sure if this was even acceptable. It's great to see same has
> been
> >> > brought up here today.
> >> >
> >> > We hadn't dealt with the ltt, pe etc. tools and wrote a script to
> >> download
> >> > them on demand to avoid massive code change in internal fork. But I
> >> have a
> >> > +1 on the idea of identifying and moving all such tools to a new
> module.
> >> > This would be great and make things easier for us as well.
> >> >
> >> > Also, a way we could help new users easily get started, in case we
> >> > completely stop bundling hadoop jars, is by providing a script which
> >> starts
> >> > a hbase cluster in a single node setup. In fact I had written a simple
> >> > script sometime back that automates this process given a release link
> >> for
> >> > both. It first downloads Hadoop and HBase binaries and then starts
> both
> >> > with the hbase root directory set to be on hdfs. We could provide
> >> something
> >> > similar to help new users to get started easily.
> >> >
> >> > Although I am also +1 on the idea to provide both variants as
> mentioned
> >> by
> >> > Nick, which might not even need any such script.
> >> >
> >> > Also, I am willing to volunteer for help towards this effort. Please
> >> let me
> >> > know if anything is needed.
> >> >
> >> > Thanks,
> >> > Nihal
> >> >
> >> >
> >> > On Tue, 5 Mar 2024, 15:35 Nick Dimiduk,  wrote:
> >> >
> >> > > This would be great cleanup, big +1 from me for all three of these
> >> > > adjustments, including the promotion of pe, ltt, and friends out of
> >> the
> >> > > test scope.
> >> > >
> >> > > I believe that we included hbase test jars because we used to freely
> >> mix
> >> > > classes needed for minicluster between runtime and test jars, which
> in
> >> > turn
> >> > > relied on Hadoop minicluster capabilities. The big cleanup around
> >> > > HBaseTestingUtil/it addressed much (or all) of these issues on
> >> branch-3.
> >> > >
> >> > > I believe that we include a Hadoop distribution in our assembly
> >> because
> >> > > that makes it easy for a new user to download our release bin.tgz
> and
> >> get
> >> > > started immediately with learning. I guess it’s high time that we
> work
> >> > out
> >> > > the with- and without-Hadoop variants.
> >> > >
> >> > > Thanks,
> >> > > Nick
> >> > >
> >> > > On Tue, 5 Mar 2024 at 09:14, Istvan Toth  wrote:
> >> > >
> >> > > > DISCLAIMER: I don't have a patch ready, or even an elegant way
> >> mapped
> >> > out
> >> > > > to achieve this, this is about discussing whether we even want to
> >> make
> >> > > > these changes.
> >> > > > These are also substantial changes, but they could be targeted for
> >> > 

Re: [DISCUSS] Removing tests and/or Hadoop from the binary assemblies

2024-03-07 Thread Istvan Toth
Thanks for volunteering, Nihal.

I could work on the Hadoop-less, and assemblies, and you could work on
cleaning up the test jars.
Would that work for you ?
I know that I'm picking the smaller part, but it turns out that I won't
have as much time to work on this as I hoped.

(Unless there are other volunteers, of course)

Istvan

On Wed, Mar 6, 2024 at 7:03 PM Istvan Toth  wrote:

> We seem to be in agreement in principle, however the devil is in the
> details.
>
> The first step should be moving the diagnostic tools out of the test jars.
> Are there any tools we don't want to move out ?
> Do the diagnostic tools pull in extra dependencies compared to the current
> runtime JARs, and if they do, what are those ?
> I haven't thought of the chaosmonkey tests yet, do those have specific
> additional dependencies / scripts ?
>
> Should we move the tools simply to the normal jars, or should we move them
> to a new module (could be called hbase-diagnostics) ?
>
> Istvan
>
> On Tue, Mar 5, 2024 at 7:10 PM Bryan Beaudreault 
> wrote:
>
>> I'm +0 on hbase-examples, but +100 on any improvements we can make to
>> ltt/pe/chaos/minicluster/etc. It's extremely frustrating how much reliance
>> we have on test jars both generally but also specifically around these
>> core
>> test executables. Unfortunately I haven't had time to dedicate to these
>> frustrations myself, but happy to help with review, etc.
>>
>> On Tue, Mar 5, 2024 at 1:03 PM Nihal Jain  wrote:
>>
>> > Thank you for bringing this up.
>> >
>> > +1 for this change.
>> >
>> > In fact, some time back, we had faced similar problem. Security scans
>> found
>> > that we were bundling some vulnerable hadoop test jar. To deal with
>> that we
>> > had to make a change in our internal HBase fork to exclude all HBase and
>> > Hadoop test jars from assembly. This helped us get rid of vulnerable
>> jar.
>> > (Although I hadn't dealt with test scope dependencies there.)
>> >
>> > But, I have been thinking of pushing this change in Apache HBase, just
>> > wasn't sure if this was even acceptable. It's great to see same has been
>> > brought up here today.
>> >
>> > We hadn't dealt with the ltt, pe etc. tools and wrote a script to
>> download
>> > them on demand to avoid massive code change in internal fork. But I
>> have a
>> > +1 on the idea of identifying and moving all such tools to a new module.
>> > This would be great and make things easier for us as well.
>> >
>> > Also, a way we could help new users easily get started, in case we
>> > completely stop bundling hadoop jars, is by providing a script which
>> starts
>> > a hbase cluster in a single node setup. In fact I had written a simple
>> > script sometime back that automates this process given a release link
>> for
>> > both. It first downloads Hadoop and HBase binaries and then starts both
>> > with the hbase root directory set to be on hdfs. We could provide
>> something
>> > similar to help new users to get started easily.
>> >
>> > Although I am also +1 on the idea to provide both variants as mentioned
>> by
>> > Nick, which might not even need any such script.
>> >
>> > Also, I am willing to volunteer for help towards this effort. Please
>> let me
>> > know if anything is needed.
>> >
>> > Thanks,
>> > Nihal
>> >
>> >
>> > On Tue, 5 Mar 2024, 15:35 Nick Dimiduk,  wrote:
>> >
>> > > This would be great cleanup, big +1 from me for all three of these
>> > > adjustments, including the promotion of pe, ltt, and friends out of
>> the
>> > > test scope.
>> > >
>> > > I believe that we included hbase test jars because we used to freely
>> mix
>> > > classes needed for minicluster between runtime and test jars, which in
>> > turn
>> > > relied on Hadoop minicluster capabilities. The big cleanup around
>> > > HBaseTestingUtil/it addressed much (or all) of these issues on
>> branch-3.
>> > >
>> > > I believe that we include a Hadoop distribution in our assembly
>> because
>> > > that makes it easy for a new user to download our release bin.tgz and
>> get
>> > > started immediately with learning. I guess it’s high time that we work
>> > out
>> > > the with- and without-Hadoop variants.
>> > >
>> > > Thanks,
>> > > Nick
>> > >
>> > > On Tue, 5 Mar 2024 at 09:14, Istvan Toth  wrote:
>> > >
>> > > > DISCLAIMER: I don't have a patch ready, or even an elegant way
>> mapped
>> > out
>> > > > to achieve this, this is about discussing whether we even want to
>> make
>> > > > these changes.
>> > > > These are also substantial changes, but they could be targeted for
>> > HBase
>> > > > 3.0.
>> > > >
>> > > > One issue I have noticed is that we ship test jars and test
>> > dependencies
>> > > in
>> > > > the assembly.
>> > > > I can't see anyone using those, but it bloats the assembly and
>> > classpath,
>> > > > and adds unnecessary JARs with possible CVE issues. (for example
>> Kerby
>> > > > which is a Hadoop minicluster dependency)
>> > > >
>> > > > My proposal is to exclude the test jars and the test scope
>> dependencies

Re: [DISCUSS] Removing tests and/or Hadoop from the binary assemblies

2024-03-06 Thread Istvan Toth
We seem to be in agreement in principle, however the devil is in the
details.

The first step should be moving the diagnostic tools out of the test jars.
Are there any tools we don't want to move out ?
Do the diagnostic tools pull in extra dependencies compared to the current
runtime JARs, and if they do, what are those ?
I haven't thought of the chaosmonkey tests yet, do those have specific
additional dependencies / scripts ?

Should we move the tools simply to the normal jars, or should we move them
to a new module (could be called hbase-diagnostics) ?

Istvan

On Tue, Mar 5, 2024 at 7:10 PM Bryan Beaudreault 
wrote:

> I'm +0 on hbase-examples, but +100 on any improvements we can make to
> ltt/pe/chaos/minicluster/etc. It's extremely frustrating how much reliance
> we have on test jars both generally but also specifically around these core
> test executables. Unfortunately I haven't had time to dedicate to these
> frustrations myself, but happy to help with review, etc.
>
> On Tue, Mar 5, 2024 at 1:03 PM Nihal Jain  wrote:
>
> > Thank you for bringing this up.
> >
> > +1 for this change.
> >
> > In fact, some time back, we had faced similar problem. Security scans
> found
> > that we were bundling some vulnerable hadoop test jar. To deal with that
> we
> > had to make a change in our internal HBase fork to exclude all HBase and
> > Hadoop test jars from assembly. This helped us get rid of vulnerable jar.
> > (Although I hadn't dealt with test scope dependencies there.)
> >
> > But, I have been thinking of pushing this change in Apache HBase, just
> > wasn't sure if this was even acceptable. It's great to see same has been
> > brought up here today.
> >
> > We hadn't dealt with the ltt, pe etc. tools and wrote a script to
> download
> > them on demand to avoid massive code change in internal fork. But I have
> a
> > +1 on the idea of identifying and moving all such tools to a new module.
> > This would be great and make things easier for us as well.
> >
> > Also, a way we could help new users easily get started, in case we
> > completely stop bundling hadoop jars, is by providing a script which
> starts
> > a hbase cluster in a single node setup. In fact I had written a simple
> > script sometime back that automates this process given a release link for
> > both. It first downloads Hadoop and HBase binaries and then starts both
> > with the hbase root directory set to be on hdfs. We could provide
> something
> > similar to help new users to get started easily.
> >
> > Although I am also +1 on the idea to provide both variants as mentioned
> by
> > Nick, which might not even need any such script.
> >
> > Also, I am willing to volunteer for help towards this effort. Please let
> me
> > know if anything is needed.
> >
> > Thanks,
> > Nihal
> >
> >
> > On Tue, 5 Mar 2024, 15:35 Nick Dimiduk,  wrote:
> >
> > > This would be great cleanup, big +1 from me for all three of these
> > > adjustments, including the promotion of pe, ltt, and friends out of the
> > > test scope.
> > >
> > > I believe that we included hbase test jars because we used to freely
> mix
> > > classes needed for minicluster between runtime and test jars, which in
> > turn
> > > relied on Hadoop minicluster capabilities. The big cleanup around
> > > HBaseTestingUtil/it addressed much (or all) of these issues on
> branch-3.
> > >
> > > I believe that we include a Hadoop distribution in our assembly because
> > > that makes it easy for a new user to download our release bin.tgz and
> get
> > > started immediately with learning. I guess it’s high time that we work
> > out
> > > the with- and without-Hadoop variants.
> > >
> > > Thanks,
> > > Nick
> > >
> > > On Tue, 5 Mar 2024 at 09:14, Istvan Toth  wrote:
> > >
> > > > DISCLAIMER: I don't have a patch ready, or even an elegant way mapped
> > out
> > > > to achieve this, this is about discussing whether we even want to
> make
> > > > these changes.
> > > > These are also substantial changes, but they could be targeted for
> > HBase
> > > > 3.0.
> > > >
> > > > One issue I have noticed is that we ship test jars and test
> > dependencies
> > > in
> > > > the assembly.
> > > > I can't see anyone using those, but it bloats the assembly and
> > classpath,
> > > > and adds unnecessary JARs with possible CVE issues. (for example
> Kerby
> > > > which is a Hadoop minicluster dependency)
> > > >
> > > > My proposal is to exclude the test jars and the test scope
> dependencies
> > > > from the assembly.
> > > >
> > > > The advantages would be:
> > > > * Smaller distro size
> > > > * Faster startup (this is marginal)
> > > > * Less CVE-prone JARs in the binary assemblies
> > > >
> > > > The other issue is that the assembly includes much of the Hadoop
> > > > distribution.
> > > > The basic assumption in all scripts and instructions is that the node
> > > has a
> > > > fully configured Hadoop installation, and we include it in the
> > classpath
> > > of
> > > > HBase.
> > > >
> > > > If that is true, 

Re: [DISCUSS] Removing tests and/or Hadoop from the binary assemblies

2024-03-05 Thread Bryan Beaudreault
I'm +0 on hbase-examples, but +100 on any improvements we can make to
ltt/pe/chaos/minicluster/etc. It's extremely frustrating how much reliance
we have on test jars both generally but also specifically around these core
test executables. Unfortunately I haven't had time to dedicate to these
frustrations myself, but happy to help with review, etc.

On Tue, Mar 5, 2024 at 1:03 PM Nihal Jain  wrote:

> Thank you for bringing this up.
>
> +1 for this change.
>
> In fact, some time back, we had faced similar problem. Security scans found
> that we were bundling some vulnerable hadoop test jar. To deal with that we
> had to make a change in our internal HBase fork to exclude all HBase and
> Hadoop test jars from assembly. This helped us get rid of vulnerable jar.
> (Although I hadn't dealt with test scope dependencies there.)
>
> But, I have been thinking of pushing this change in Apache HBase, just
> wasn't sure if this was even acceptable. It's great to see same has been
> brought up here today.
>
> We hadn't dealt with the ltt, pe etc. tools and wrote a script to download
> them on demand to avoid massive code change in internal fork. But I have a
> +1 on the idea of identifying and moving all such tools to a new module.
> This would be great and make things easier for us as well.
>
> Also, a way we could help new users easily get started, in case we
> completely stop bundling hadoop jars, is by providing a script which starts
> a hbase cluster in a single node setup. In fact I had written a simple
> script sometime back that automates this process given a release link for
> both. It first downloads Hadoop and HBase binaries and then starts both
> with the hbase root directory set to be on hdfs. We could provide something
> similar to help new users to get started easily.
>
> Although I am also +1 on the idea to provide both variants as mentioned by
> Nick, which might not even need any such script.
>
> Also, I am willing to volunteer for help towards this effort. Please let me
> know if anything is needed.
>
> Thanks,
> Nihal
>
>
> On Tue, 5 Mar 2024, 15:35 Nick Dimiduk,  wrote:
>
> > This would be great cleanup, big +1 from me for all three of these
> > adjustments, including the promotion of pe, ltt, and friends out of the
> > test scope.
> >
> > I believe that we included hbase test jars because we used to freely mix
> > classes needed for minicluster between runtime and test jars, which in
> turn
> > relied on Hadoop minicluster capabilities. The big cleanup around
> > HBaseTestingUtil/it addressed much (or all) of these issues on branch-3.
> >
> > I believe that we include a Hadoop distribution in our assembly because
> > that makes it easy for a new user to download our release bin.tgz and get
> > started immediately with learning. I guess it’s high time that we work
> out
> > the with- and without-Hadoop variants.
> >
> > Thanks,
> > Nick
> >
> > On Tue, 5 Mar 2024 at 09:14, Istvan Toth  wrote:
> >
> > > DISCLAIMER: I don't have a patch ready, or even an elegant way mapped
> out
> > > to achieve this, this is about discussing whether we even want to make
> > > these changes.
> > > These are also substantial changes, but they could be targeted for
> HBase
> > > 3.0.
> > >
> > > One issue I have noticed is that we ship test jars and test
> dependencies
> > in
> > > the assembly.
> > > I can't see anyone using those, but it bloats the assembly and
> classpath,
> > > and adds unnecessary JARs with possible CVE issues. (for example Kerby
> > > which is a Hadoop minicluster dependency)
> > >
> > > My proposal is to exclude the test jars and the test scope dependencies
> > > from the assembly.
> > >
> > > The advantages would be:
> > > * Smaller distro size
> > > * Faster startup (this is marginal)
> > > * Less CVE-prone JARs in the binary assemblies
> > >
> > > The other issue is that the assembly includes much of the Hadoop
> > > distribution.
> > > The basic assumption in all scripts and instructions is that the node
> > has a
> > > fully configured Hadoop installation, and we include it in the
> classpath
> > of
> > > HBase.
> > >
> > > If that is true, then there is no reason to include Hadoop in the
> > assembly,
> > > HBase and its direct dependencies should be enough.
> > >
> > > One could argue that it would simplify the client side, which is true
> to
> > > some extent (though 95% of the client distro use cases are served
> better
> > by
> > > simply using hbase-shaded-client).
> > >
> > > We could either remove the Hadoop libraries from either or both of the
> > > assemblies unconditionally, or provide two variants for either or both
> > > assemblies, one with Hadoop included, and one without it.
> > > Spark already does this, it has binary distributions both with and
> > without
> > > Hadoop.
> > >
> > > The advantages would be:
> > > * Smaller distro size
> > > * Faster startup (this is marginal)
> > > * Less chance of conflicts with the Hadoop jars
> > > * Less CVE-prone JARs in the 

Re: [DISCUSS] Removing tests and/or Hadoop from the binary assemblies

2024-03-05 Thread Nihal Jain
Thank you for bringing this up.

+1 for this change.

In fact, some time back, we had faced similar problem. Security scans found
that we were bundling some vulnerable hadoop test jar. To deal with that we
had to make a change in our internal HBase fork to exclude all HBase and
Hadoop test jars from assembly. This helped us get rid of vulnerable jar.
(Although I hadn't dealt with test scope dependencies there.)

But, I have been thinking of pushing this change in Apache HBase, just
wasn't sure if this was even acceptable. It's great to see same has been
brought up here today.

We hadn't dealt with the ltt, pe etc. tools and wrote a script to download
them on demand to avoid massive code change in internal fork. But I have a
+1 on the idea of identifying and moving all such tools to a new module.
This would be great and make things easier for us as well.

Also, a way we could help new users easily get started, in case we
completely stop bundling hadoop jars, is by providing a script which starts
a hbase cluster in a single node setup. In fact I had written a simple
script sometime back that automates this process given a release link for
both. It first downloads Hadoop and HBase binaries and then starts both
with the hbase root directory set to be on hdfs. We could provide something
similar to help new users to get started easily.

Although I am also +1 on the idea to provide both variants as mentioned by
Nick, which might not even need any such script.

Also, I am willing to volunteer for help towards this effort. Please let me
know if anything is needed.

Thanks,
Nihal


On Tue, 5 Mar 2024, 15:35 Nick Dimiduk,  wrote:

> This would be great cleanup, big +1 from me for all three of these
> adjustments, including the promotion of pe, ltt, and friends out of the
> test scope.
>
> I believe that we included hbase test jars because we used to freely mix
> classes needed for minicluster between runtime and test jars, which in turn
> relied on Hadoop minicluster capabilities. The big cleanup around
> HBaseTestingUtil/it addressed much (or all) of these issues on branch-3.
>
> I believe that we include a Hadoop distribution in our assembly because
> that makes it easy for a new user to download our release bin.tgz and get
> started immediately with learning. I guess it’s high time that we work out
> the with- and without-Hadoop variants.
>
> Thanks,
> Nick
>
> On Tue, 5 Mar 2024 at 09:14, Istvan Toth  wrote:
>
> > DISCLAIMER: I don't have a patch ready, or even an elegant way mapped out
> > to achieve this, this is about discussing whether we even want to make
> > these changes.
> > These are also substantial changes, but they could be targeted for HBase
> > 3.0.
> >
> > One issue I have noticed is that we ship test jars and test dependencies
> in
> > the assembly.
> > I can't see anyone using those, but it bloats the assembly and classpath,
> > and adds unnecessary JARs with possible CVE issues. (for example Kerby
> > which is a Hadoop minicluster dependency)
> >
> > My proposal is to exclude the test jars and the test scope dependencies
> > from the assembly.
> >
> > The advantages would be:
> > * Smaller distro size
> > * Faster startup (this is marginal)
> > * Less CVE-prone JARs in the binary assemblies
> >
> > The other issue is that the assembly includes much of the Hadoop
> > distribution.
> > The basic assumption in all scripts and instructions is that the node
> has a
> > fully configured Hadoop installation, and we include it in the classpath
> of
> > HBase.
> >
> > If that is true, then there is no reason to include Hadoop in the
> assembly,
> > HBase and its direct dependencies should be enough.
> >
> > One could argue that it would simplify the client side, which is true to
> > some extent (though 95% of the client distro use cases are served better
> by
> > simply using hbase-shaded-client).
> >
> > We could either remove the Hadoop libraries from either or both of the
> > assemblies unconditionally, or provide two variants for either or both
> > assemblies, one with Hadoop included, and one without it.
> > Spark already does this, it has binary distributions both with and
> without
> > Hadoop.
> >
> > The advantages would be:
> > * Smaller distro size
> > * Faster startup (this is marginal)
> > * Less chance of conflicts with the Hadoop jars
> > * Less CVE-prone JARs in the binary assemblies
> >
> >
> > Thirdly, we could consider excluding the
> > full-fat org.apache.hbase:hbase-shaded-client JAR from the Hadoop-less
> > binary assemblies. It is not used by the assembly, and AFAIK it is not
> > included in any of the 'hbase classpath' command variants.
> >
> > This would make sure that no Hadoop libraries are included (even in
> shaded
> > form) and would make the HBase distribution fully insulated from Hadoop's
> > CVE issues.
> >
> > (The full-fat hbase-shaded-client works best as direct build-time
> > dependency anyway)
> >
> > best regards
> > Istvan
> >
>


Re: [DISCUSS] Removing tests and/or Hadoop from the binary assemblies

2024-03-05 Thread Nick Dimiduk
This would be great cleanup, big +1 from me for all three of these
adjustments, including the promotion of pe, ltt, and friends out of the
test scope.

I believe that we included hbase test jars because we used to freely mix
classes needed for minicluster between runtime and test jars, which in turn
relied on Hadoop minicluster capabilities. The big cleanup around
HBaseTestingUtil/it addressed much (or all) of these issues on branch-3.

I believe that we include a Hadoop distribution in our assembly because
that makes it easy for a new user to download our release bin.tgz and get
started immediately with learning. I guess it’s high time that we work out
the with- and without-Hadoop variants.

Thanks,
Nick

On Tue, 5 Mar 2024 at 09:14, Istvan Toth  wrote:

> DISCLAIMER: I don't have a patch ready, or even an elegant way mapped out
> to achieve this, this is about discussing whether we even want to make
> these changes.
> These are also substantial changes, but they could be targeted for HBase
> 3.0.
>
> One issue I have noticed is that we ship test jars and test dependencies in
> the assembly.
> I can't see anyone using those, but it bloats the assembly and classpath,
> and adds unnecessary JARs with possible CVE issues. (for example Kerby
> which is a Hadoop minicluster dependency)
>
> My proposal is to exclude the test jars and the test scope dependencies
> from the assembly.
>
> The advantages would be:
> * Smaller distro size
> * Faster startup (this is marginal)
> * Less CVE-prone JARs in the binary assemblies
>
> The other issue is that the assembly includes much of the Hadoop
> distribution.
> The basic assumption in all scripts and instructions is that the node has a
> fully configured Hadoop installation, and we include it in the classpath of
> HBase.
>
> If that is true, then there is no reason to include Hadoop in the assembly,
> HBase and its direct dependencies should be enough.
>
> One could argue that it would simplify the client side, which is true to
> some extent (though 95% of the client distro use cases are served better by
> simply using hbase-shaded-client).
>
> We could either remove the Hadoop libraries from either or both of the
> assemblies unconditionally, or provide two variants for either or both
> assemblies, one with Hadoop included, and one without it.
> Spark already does this, it has binary distributions both with and without
> Hadoop.
>
> The advantages would be:
> * Smaller distro size
> * Faster startup (this is marginal)
> * Less chance of conflicts with the Hadoop jars
> * Less CVE-prone JARs in the binary assemblies
>
>
> Thirdly, we could consider excluding the
> full-fat org.apache.hbase:hbase-shaded-client JAR from the Hadoop-less
> binary assemblies. It is not used by the assembly, and AFAIK it is not
> included in any of the 'hbase classpath' command variants.
>
> This would make sure that no Hadoop libraries are included (even in shaded
> form) and would make the HBase distribution fully insulated from Hadoop's
> CVE issues.
>
> (The full-fat hbase-shaded-client works best as direct build-time
> dependency anyway)
>
> best regards
> Istvan
>


Re: [DISCUSS] Removing tests and/or Hadoop from the binary assemblies

2024-03-05 Thread Istvan Toth
I agree, we don't want to omit those from the binary distro.
We should identify what those tools are. (Should be easy based on the
presence of main() or the Tool interface)
Such tools could either be moved into a new module, like hbase-tools, or
simply moved to the runtime JARs.

Istvan

On Tue, Mar 5, 2024 at 10:34 AM 张铎(Duo Zhang)  wrote:

> There are some tools in the tests jar, such as PerformanceEvaluation.
>
> But anyway, maybe they should be moved to main...
>
> Istvan Toth  于2024年3月5日周二 16:14写道:
> >
> > DISCLAIMER: I don't have a patch ready, or even an elegant way mapped out
> > to achieve this, this is about discussing whether we even want to make
> > these changes.
> > These are also substantial changes, but they could be targeted for HBase
> > 3.0.
> >
> > One issue I have noticed is that we ship test jars and test dependencies
> in
> > the assembly.
> > I can't see anyone using those, but it bloats the assembly and classpath,
> > and adds unnecessary JARs with possible CVE issues. (for example Kerby
> > which is a Hadoop minicluster dependency)
> >
> > My proposal is to exclude the test jars and the test scope dependencies
> > from the assembly.
> >
> > The advantages would be:
> > * Smaller distro size
> > * Faster startup (this is marginal)
> > * Less CVE-prone JARs in the binary assemblies
> >
> > The other issue is that the assembly includes much of the Hadoop
> > distribution.
> > The basic assumption in all scripts and instructions is that the node
> has a
> > fully configured Hadoop installation, and we include it in the classpath
> of
> > HBase.
> >
> > If that is true, then there is no reason to include Hadoop in the
> assembly,
> > HBase and its direct dependencies should be enough.
> >
> > One could argue that it would simplify the client side, which is true to
> > some extent (though 95% of the client distro use cases are served better
> by
> > simply using hbase-shaded-client).
> >
> > We could either remove the Hadoop libraries from either or both of the
> > assemblies unconditionally, or provide two variants for either or both
> > assemblies, one with Hadoop included, and one without it.
> > Spark already does this, it has binary distributions both with and
> without
> > Hadoop.
> >
> > The advantages would be:
> > * Smaller distro size
> > * Faster startup (this is marginal)
> > * Less chance of conflicts with the Hadoop jars
> > * Less CVE-prone JARs in the binary assemblies
> >
> >
> > Thirdly, we could consider excluding the
> > full-fat org.apache.hbase:hbase-shaded-client JAR from the Hadoop-less
> > binary assemblies. It is not used by the assembly, and AFAIK it is not
> > included in any of the 'hbase classpath' command variants.
> >
> > This would make sure that no Hadoop libraries are included (even in
> shaded
> > form) and would make the HBase distribution fully insulated from Hadoop's
> > CVE issues.
> >
> > (The full-fat hbase-shaded-client works best as direct build-time
> > dependency anyway)
> >
> > best regards
> > Istvan
>


-- 
*István Tóth* | Sr. Staff Software Engineer
*Email*: st...@cloudera.com
cloudera.com 
[image: Cloudera] 
[image: Cloudera on Twitter]  [image:
Cloudera on Facebook]  [image: Cloudera
on LinkedIn] 
--
--


Re: [DISCUSS] Removing tests and/or Hadoop from the binary assemblies

2024-03-05 Thread Duo Zhang
There are some tools in the tests jar, such as PerformanceEvaluation.

But anyway, maybe they should be moved to main...

Istvan Toth  于2024年3月5日周二 16:14写道:
>
> DISCLAIMER: I don't have a patch ready, or even an elegant way mapped out
> to achieve this, this is about discussing whether we even want to make
> these changes.
> These are also substantial changes, but they could be targeted for HBase
> 3.0.
>
> One issue I have noticed is that we ship test jars and test dependencies in
> the assembly.
> I can't see anyone using those, but it bloats the assembly and classpath,
> and adds unnecessary JARs with possible CVE issues. (for example Kerby
> which is a Hadoop minicluster dependency)
>
> My proposal is to exclude the test jars and the test scope dependencies
> from the assembly.
>
> The advantages would be:
> * Smaller distro size
> * Faster startup (this is marginal)
> * Less CVE-prone JARs in the binary assemblies
>
> The other issue is that the assembly includes much of the Hadoop
> distribution.
> The basic assumption in all scripts and instructions is that the node has a
> fully configured Hadoop installation, and we include it in the classpath of
> HBase.
>
> If that is true, then there is no reason to include Hadoop in the assembly,
> HBase and its direct dependencies should be enough.
>
> One could argue that it would simplify the client side, which is true to
> some extent (though 95% of the client distro use cases are served better by
> simply using hbase-shaded-client).
>
> We could either remove the Hadoop libraries from either or both of the
> assemblies unconditionally, or provide two variants for either or both
> assemblies, one with Hadoop included, and one without it.
> Spark already does this, it has binary distributions both with and without
> Hadoop.
>
> The advantages would be:
> * Smaller distro size
> * Faster startup (this is marginal)
> * Less chance of conflicts with the Hadoop jars
> * Less CVE-prone JARs in the binary assemblies
>
>
> Thirdly, we could consider excluding the
> full-fat org.apache.hbase:hbase-shaded-client JAR from the Hadoop-less
> binary assemblies. It is not used by the assembly, and AFAIK it is not
> included in any of the 'hbase classpath' command variants.
>
> This would make sure that no Hadoop libraries are included (even in shaded
> form) and would make the HBase distribution fully insulated from Hadoop's
> CVE issues.
>
> (The full-fat hbase-shaded-client works best as direct build-time
> dependency anyway)
>
> best regards
> Istvan


[DISCUSS] Removing tests and/or Hadoop from the binary assemblies

2024-03-05 Thread Istvan Toth
DISCLAIMER: I don't have a patch ready, or even an elegant way mapped out
to achieve this, this is about discussing whether we even want to make
these changes.
These are also substantial changes, but they could be targeted for HBase
3.0.

One issue I have noticed is that we ship test jars and test dependencies in
the assembly.
I can't see anyone using those, but it bloats the assembly and classpath,
and adds unnecessary JARs with possible CVE issues. (for example Kerby
which is a Hadoop minicluster dependency)

My proposal is to exclude the test jars and the test scope dependencies
from the assembly.

The advantages would be:
* Smaller distro size
* Faster startup (this is marginal)
* Less CVE-prone JARs in the binary assemblies

The other issue is that the assembly includes much of the Hadoop
distribution.
The basic assumption in all scripts and instructions is that the node has a
fully configured Hadoop installation, and we include it in the classpath of
HBase.

If that is true, then there is no reason to include Hadoop in the assembly,
HBase and its direct dependencies should be enough.

One could argue that it would simplify the client side, which is true to
some extent (though 95% of the client distro use cases are served better by
simply using hbase-shaded-client).

We could either remove the Hadoop libraries from either or both of the
assemblies unconditionally, or provide two variants for either or both
assemblies, one with Hadoop included, and one without it.
Spark already does this, it has binary distributions both with and without
Hadoop.

The advantages would be:
* Smaller distro size
* Faster startup (this is marginal)
* Less chance of conflicts with the Hadoop jars
* Less CVE-prone JARs in the binary assemblies


Thirdly, we could consider excluding the
full-fat org.apache.hbase:hbase-shaded-client JAR from the Hadoop-less
binary assemblies. It is not used by the assembly, and AFAIK it is not
included in any of the 'hbase classpath' command variants.

This would make sure that no Hadoop libraries are included (even in shaded
form) and would make the HBase distribution fully insulated from Hadoop's
CVE issues.

(The full-fat hbase-shaded-client works best as direct build-time
dependency anyway)

best regards
Istvan