[
https://issues.apache.org/jira/browse/HBASE-20218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744707#comment-16744707
]
Hadoop QA commented on HBASE-20218:
-----------------------------------
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 6s{color}
| {color:red} HBASE-20218 does not apply to master. Rebase required? Wrong
Branch? See https://yetus.apache.org/documentation/0.8.0/precommit-patchnames
for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HBASE-20218 |
| JIRA Patch URL |
https://issues.apache.org/jira/secure/attachment/12915171/HBASE-20218.01.patch |
| Console output |
https://builds.apache.org/job/PreCommit-HBASE-Build/15611/console |
| Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
> Proposed Performance Enhancements For TableSnapshotInputFomat
> -------------------------------------------------------------
>
> Key: HBASE-20218
> URL: https://issues.apache.org/jira/browse/HBASE-20218
> Project: HBase
> Issue Type: Improvement
> Components: mapreduce
> Affects Versions: 1.4.0
> Environment: HBase 1.4.0 running in AWS EMR 5.12.0 with the HBase
> rootdir set to a folder in S3
>
> Reporter: Saad Mufti
> Priority: Minor
> Labels: s3
> Attachments: HBASE-20218.01.patch, HBASE-20218.patch
>
>
> I have been testing a few Spark jobs we have at my company which work off of
> TableSnapshotInputFormat to read directly from the filesystem snapshots
> created on another EMR/Hbase cluster and stored in S3. During performance
> testing I found various small changes which would greatly enhance peformance.
> Right now we are running our jobs linked with a patched version of HBase
> 1.4.0 in which I made these changes, and I am hoping to submit my patch for
> review and eventual acceptance into the main codebase.
>
> The list of changes are :
>
> 1. a flag to control whether the snapshot restore uses a UUID based random
> temp dir in the specified restore directory. We use the flag to turn this off
> so that we can benefit from a AWS S3 specific bucket partitioning scheme we
> have provisioned. The way S3 partitioning works, you have to give a fixed
> path prefix and a pattern of files after that, and AWS can then partition on
> the paths after the fixed prefix into different resources to get more
> parallelization. We were advised by AWS that we could only get this good
> partitioning behavior if we didn't have that rancom directory in there.
>
> 2. a flag to turn off the code that tries to compute locality information
> for the splits. This is useless when dealing with S3 since the files are not
> on the cluster so there is no use in computing locality; and worse yet, it
> uses a single thread in the driver to iterate over all the files in the
> restored snapshot. For a very large table this was taking hours and hours
> iterating through S3 objects just to list them (about 360,000 of them for the
> our specific table).
>
> 3. a flag to override the column family schema setting to prefetch regions on
> open. This was causing the main executor thread on which a Spark task was
> running, which was trying to read through HFile's for its scan, compete for a
> lock on the underlying EMRFS stream object with prefetch threads trying to
> read the same file, so most tasks in the Spark stage would finish but the
> last few would linger half an hour or more competing with the prefetch
> threads alternately for a lock on an EMRFS stream object. This is the only
> change that had to be outside the mapreduce package as it directly affects
> the prefetch behavior in CacheConfig.java
>
> 4. a flag to turn off maintenance of Scan metrics. this was also causing a
> major slowdown, getting rid of this sped things up 4-5 times. What I observed
> in the thread dumps was that every call to update scan metrics was trying to
> get some HBase counter object and deep underneath was trying to access some
> Java resource bundle, and throwing an exception that it wasn't found. The
> exception was never visible at the application level and was swallowed
> underneath but whatever it was doing was causing a major slowdown. So we use
> this flag to avoid collecting those metrics because we never used them
>
> I am polishing my patch a bit more and hopefully will attach it tomorrow. One
> caveat, I tried but struggled with how to write any useful unit/component
> tests for these as these are invisible behaviors that do not affect the final
> result at all. And I am not that familiar with the HBase testing standards,
> so for now I am looking for guidance on what to tests.
>
> Would appreciate any feedback plus guidance on writing tests, provided of
> course there is interest in incorporating my patch into the main codebase.
>
> Cheers.
>
> ----Saad
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)