[
https://issues.apache.org/jira/browse/HBASE-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bryan Beaudreault updated HBASE-26909:
--------------------------------------
Assignee: Bryan Beaudreault
Labels: patch-available (was: )
Status: Patch Available (was: Open)
I did this a little backwards – attached PR is for branch-2. I have a virtually
identical PR ready for master as well, but don't want to confuse things by
submitting early. I can submit once we're ready.
In the end I decided not to tackle the hadoop dependency issue noted above.
I'll potentially file a separate Jira for that later, as I think it will be
really tricky.
> hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes
> ------------------------------------------------------------------------------
>
> Key: HBASE-26909
> URL: https://issues.apache.org/jira/browse/HBASE-26909
> Project: HBase
> Issue Type: Improvement
> Reporter: Bryan Beaudreault
> Assignee: Bryan Beaudreault
> Priority: Major
> Labels: patch-available
>
> We supply 2 primary artifacts for end-users to consume:
> * hbase-shaded-client, which is for general use
> * hbase-shaded-mapreduce, which is for use when you need to connect to hbase
> via mapreduce. For example, TableInputFormat
> The problem is that these artifacts expose tons of duplicate classes. One
> example (among many) is org.apache.hadoop.hbase.Cell, which appears in both
> jars.
> This may not be a problem if your projects are always very isolated – either
> doing mapreduce, or not. In that case you just depend in the one you need.
> Many users might exist in much more complicated environments where
> dependencies tend to bleed along more between projects. Here's an
> illustration:
> * Imagine a project FooService, which includes two modules FooServiceRestWeb
> (for the rest http resources) and FooServiceData (which includes DAOs for
> accessing data). FooServiceRestWeb depends on FooServiceData to access hbase.
> In this case, FooServiceData should depend on hbase-shaded-client.
> * Now imagine another project FooPipeline, which has modules
> FooPipelineHadoop (with M/R jobs for processing data) and FooPipelineData
> (which has some DAOs for accessing data). In this case, FooPipelineData might
> depend on hbase-shaded-mapreduce since the context is intended for M/R.
> * The problem arises when suddenly we want to include some data from
> FooService into our pipeline. The most straightforward way to achieve this is
> by depending on FooServiceData, which has all of he DAOs for that data but
> also depends on hbase-shaded-client. At this point you have a problem,
> because FooPipelineHadoop now depends on both hbase-shaded-mapreduce and
> hbase-shaded-client.
> (Note, this obviously skirts around potential microservice solutions like
> only accessing FooService's data through the API... it's just for
> illustration, and it does come up.)
> From a plain java perspective, having these 2 jars on the classpath is
> somewhat wasteful but not a huge issue since the implementations are all the
> same.
> From a maven perspective, it's problematic because the maven dependency
> plugin will complain about the conflicting classes.
> One potential fix is to add exclusions to the FooServiceData dependency, to
> avoid pulling in hbase-shaded-client. This works on a one-off basis but is
> much more painful in a large and complicated environment where this may come
> up hundreds of times.
> A better fix in my opinion is to make hbase-shaded-mapreduce depend on
> hbase-shaded-client and then only expose the classes that aren't already
> exposed by the shaded client.
> [~busbey] also mentioned a BOM being a potential solution, but I don't have
> experience with that.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)