[
https://issues.apache.org/jira/browse/HBASE-20218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404350#comment-16404350
]
Saad Mufti commented on HBASE-20218:
------------------------------------
I deleted the original patch and submitted a new one against the master branch,
with what I hope is the right naming convention. One of the flags I did no
longer applies to master as the master branch already has a flag
"hbase.TableSnapshotInputFormat.locality.enabled" which does the exact same
thing. Although that flag does not exist in HBase 1.4.0 which I was working
with. So I am hoping it gets backported along with the rest of my changes, if
possible.
Also, I had great trouble just compiling the master branch, I got all sorts of
dependency errors from Maven and also compilation errors about various method
that had the @Override annotation not actually overriding anything. In my IDE I
tried recompiling just the files I changed and
TableSnapshotInputFormatImple.java had compilation errors but none of them
looked related to my patch in any way. I didn't have time yet to figure out
what was going on, so for now I am submitting my patch as is.
Cheers.
> Proposed Performance Enhancements For TableSnapshotInputFomat
> -------------------------------------------------------------
>
> Key: HBASE-20218
> URL: https://issues.apache.org/jira/browse/HBASE-20218
> Project: HBase
> Issue Type: Improvement
> Components: mapreduce
> Affects Versions: 1.4.0
> Environment: HBase 1.4.0 running in AWS EMR 5.12.0 with the HBase
> rootdir set to a folder in S3
>
> Reporter: Saad Mufti
> Priority: Minor
> Attachments: HBASE-20218.patch
>
>
> I have been testing a few Spark jobs we have at my company which work off of
> TableSnapshotInputFormat to read directly from the filesystem snapshots
> created on another EMR/Hbase cluster and stored in S3. During performance
> testing I found various small changes which would greatly enhance peformance.
> Right now we are running our jobs linked with a patched version of HBase
> 1.4.0 in which I made these changes, and I am hoping to submit my patch for
> review and eventual acceptance into the main codebase.
>
> The list of changes are :
>
> 1. a flag to control whether the snapshot restore uses a UUID based random
> temp dir in the specified restore directory. We use the flag to turn this off
> so that we can benefit from a AWS S3 specific bucket partitioning scheme we
> have provisioned. The way S3 partitioning works, you have to give a fixed
> path prefix and a pattern of files after that, and AWS can then partition on
> the paths after the fixed prefix into different resources to get more
> parallelization. We were advised by AWS that we could only get this good
> partitioning behavior if we didn't have that rancom directory in there.
>
> 2. a flag to turn off the code that tries to compute locality information
> for the splits. This is useless when dealing with S3 since the files are not
> on the cluster so there is no use in computing locality; and worse yet, it
> uses a single thread in the driver to iterate over all the files in the
> restored snapshot. For a very large table this was taking hours and hours
> iterating through S3 objects just to list them (about 360,000 of them for the
> our specific table).
>
> 3. a flag to override the column family schema setting to prefetch regions on
> open. This was causing the main executor thread on which a Spark task was
> running, which was trying to read through HFile's for its scan, compete for a
> lock on the underlying EMRFS stream object with prefetch threads trying to
> read the same file, so most tasks in the Spark stage would finish but the
> last few would linger half an hour or more competing with the prefetch
> threads alternately for a lock on an EMRFS stream object. This is the only
> change that had to be outside the mapreduce package as it directly affects
> the prefetch behavior in CacheConfig.java
>
> 4. a flag to turn off maintenance of Scan metrics. this was also causing a
> major slowdown, getting rid of this sped things up 4-5 times. What I observed
> in the thread dumps was that every call to update scan metrics was trying to
> get some HBase counter object and deep underneath was trying to access some
> Java resource bundle, and throwing an exception that it wasn't found. The
> exception was never visible at the application level and was swallowed
> underneath but whatever it was doing was causing a major slowdown. So we use
> this flag to avoid collecting those metrics because we never used them
>
> I am polishing my patch a bit more and hopefully will attach it tomorrow. One
> caveat, I tried but struggled with how to write any useful unit/component
> tests for these as these are invisible behaviors that do not affect the final
> result at all. And I am not that familiar with the HBase testing standards,
> so for now I am looking for guidance on what to tests.
>
> Would appreciate any feedback plus guidance on writing tests, provided of
> course there is interest in incorporating my patch into the main codebase.
>
> Cheers.
>
> ----Saad
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)