Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Josh Elser Tue, 25 May 2021 08:46:10 -0700

Coming full circle on the "makes me worry" comment I left:

I asked the question in work channels about my concern and SteveL didconfirm that the "S3 strong consistency" feature does apply generally toCRUD operations.

I believe this means, if we assume there is exactly one RegionServerwhich is hosting a Region at one time, that one RegionServer is capableof ensuring that the gaps which do exist in S3 are a non-issue (withoutthe need for an HBOSS-like solution).

Taking the suggested on a file-per-store which enumerates the committedfiles: the RegionServer can make sure that operates which concurrentlywant to update that file are exclusive, e.g. a bulk load, a memstoreflush, a compaction commit.

On my plate today is to incorporate this into a design doc specificallyfor storefile metadata (from the other message in this broader thread)


On 5/24/21 1:39 PM, Josh Elser wrote:

I got pulled into a call with some folks from S3 at the last minute lateweek.
There was a comment made in passing about reading the latest, writtenversion of a file. At the moment, I didn't want to digress into thatbecause of immutable HFiles. However, if we're tracking files-per-storein a file, that makes me worry.
To the nice digging both Duo and Andrew have shared here already andNick's point about design, I definitely think stating what we expect andmapping that to the "platforms" which provide that "today" (as we knoweach will change) is the only way to insulate ourselves. The Hadoop FScontract tests are also a great thing we can adopt.
On 5/21/21 9:53 PM, 张铎(Duo Zhang) wrote:
So maybe we could introduce a .hfilelist directory, and put the hflielist
files under this directory, so we do not need to list all the files under
the region directory.

And considering the possible implementation for typical object storages,
listing the last directory on the whole path will be less expensive.
Andrew Purtell <[email protected]> 于2021年5月22日周六上午9:35写道：
On May 21, 2021, at 6:07 PM, 张铎 <[email protected]> wrote:

Since we just make use of the general FileSystem API to do listing, is
it
possible to make use of ' bucket index listing'?
Yes, those words mean the same thing.
Andrew Purtell <[email protected]> 于2021年5月22日周六上午6:34写道：
On May 20, 2021, at 4:00 AM, Wellington Chevreuil <
[email protected]> wrote:
IMO it should be a file per store.
Per region is not suitable here as compaction is per store.
Per file means we still need to list all the files. And usually,after
compaction, we need to do an atomic operation to remove several old
files
and add a new file, or even several files for stripe compaction. It
will be
easy if we just write one file to commit these changes.
Fine for me if it's simpler. Mentioned the per file approachbecause I
thought it could be easier/faster to do that, rather than having to
update
the store file list on every flush. AFAIK, append is out of thetable,
so
updating this file would mean read it, write original content plusnew
hfile to a temp file, delete original file, rename it).
That sounds right to be.

A minor potential optimization is the filename could have a timestamp
component, so a bucket index listing at that path would pick up a list
including the latest, and the latest would be used as the manifest of
valid
store files. The cloud object store is expected to provide an atomic
listing semantic where the file is written and closed and only then is
it
visible, and it is visible at once to everyone. (I think this is
available
on most.) Old manifest file versions could be lazily deleted.
Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) <
[email protected]>
escreveu:

IIRC S3 is the only object storage which does not guarantee
read-after-write consistency in the past...

This is the quick result after googling

AWS [1]
Amazon S3 delivers strong read-after-write consistencyautomatically
for
all applications
Azure[2]
Azure Storage was designed to embrace a strong consistency modelthat
guarantees that after the service performs an insert or update
operation,
subsequent read operations return the latest update.
Aliyun[3]
A feature requires that object operations in OSS be atomic, which
indicates that operations can only either succeed or fail without
intermediate states. To ensure that users can access only complete
data,
OSS does not return corrupted or partial data.

Object operations in OSS are highly consistent. For example, when a
user
receives an upload (PUT) success response, the uploaded objectcan be
read
immediately, and copies of the object are written to multipledevices
for
redundancy. Therefore, the situations where data is not obtainedwhen
you
perform the read-after-write operation do not exist. The same istrue
for
delete operations. After you delete an object, the object and its
copies
no
longer exist.
GCP[4]
Cloud Storage provides strong global consistency for the following
operations, including both data and metadata:

Read-after-write
Read-after-metadata-update
Read-after-delete
Bucket listing
Object listing
I think these vendors could cover most end users in the world?

1. https://aws.amazon.com/cn/s3/consistency/
2.
https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet
3. https://www.alibabacloud.com/help/doc-detail/31827.htm
4. https://cloud.google.com/storage/docs/consistency
Nick Dimiduk <[email protected]> 于2021年5月19日周三下午11:40写道：
On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang)<[email protected]
wrote:
What about just storing the hfile list in a file? Since now S3 has
strong
consistency, we could safely overwrite a file then I think?
My concern is about portability. S3 isn't the only blob store in
town,
and
consistent read-what-you-wrote semantics are not a standardfeature,
as
far
as I know. If we want something that can work on 3 or 5 majorpublic
cloud
blobstore products as well as a smattering of on-prem technologies,
we
should be selective about what features we choose to rely on as
foundational to our implementation.
Or we are explicitly saying this will only work on S3 and we'llonly
support other services when they can achieve this level of
compatibility.
Either way, we should be clear and up-front about what semantics we
demand.
Implementing some kind of a test harness that can checkcompatibility
would
help here, a similar effort to that of defining standardbehaviors of
HDFS
implementations.

I love this discussion :)

And since the hfile list file will be very small, renaming will not
be
a
big problem.
Would this be a file per store? A file per region? Ah. Below you
imply
it's
per store.
Wellington Chevreuil <[email protected]> 于2021年5月19日周三
下午10:43写道：
Thank you, Andrew and Duo,
Talking internally with Josh Elser, initial idea was to rebasethe
feature
branch with master (in order to catch with latest commits), then
focus
on
work to have a minimal functioning hbase, in other words,together
with
the
already committed work from HBASE-25391, make sure flush,
compactions,
splits and merges all can take advantage of the persistent store
file
manager and complete with no need to rely on renames. Theseall map
to
the
substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could
test
and
validate this works well for our goals, we can then focus on
snapshots,
bulkloading and tooling.
S3 now supports strong consistency, and I heard that they arealso
implementing atomic renaming currently, so maybe that's oneof the
reasons
why the development is silent now..
Interesting, I had no idea this was being implemented. I know,
however, a
version of this feature is already available on latest EMRreleases
(at
least from 6.2.0), and AWS team has published their own blog post
with
their results:
https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
But I do not think store hfile list in meta is the only solution.
It
will
cause cyclic dependencies for hbase:meta, and then force us ahave
a
fallback solution which makes the code a bit ugly. We should try
to
see
if
this could be done with only the FileSystem.
This is indeed a relevant concern. One idea I had mentioned inthe
original
design doc was to track committed/non-committed files throughxattr
(or
tags), which may have its own performance issues as explained by
Stephen
Wu, but is something that could be attempted.

Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <
[email protected]
escreveu:
S3 now supports strong consistency, and I heard that they arealsoimplementing atomic renaming currently, so maybe that's oneof the
reasons
why the development is silent now...

For me, I also think deploying hbase on cloud storage is the
future,
so I
would also like to participate here.
But I do not think store hfile list in meta is the onlysolution.
It
will
cause cyclic dependencies for hbase:meta, and then force us ahave
a
fallback solution which makes the code a bit ugly. We should try
to
see
if
this could be done with only the FileSystem.

Thanks.
Andrew Purtell <[email protected]> 于2021年5月19日周三上午8:04写道：
Wellington (and et. al),

S3 is also an important piece of our future production plans.
Unfortunately,  we were unable to assist much with last year's
work,
on
account of being sidetracked by more immediate concerns.
Fortunately,
this
renewed interest is timely in that we have an HBase 2 project
where,
if
this can land in a 2.5 or a 2.6, it could be an importantcost to
serve
optimization, and one we could and would make use of.Therefore I
would
like to restate my employer's interest in this work too. It may
just
be
Viraj and myself in the early days.

I'm not sure how best to collaborate. We could review changes
from
the
original authors, new changes, and/or divide up the development
tasks.
We
can certainly offer our time for testing, and can afford the
costs
of
testing against the S3 service.


On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil <
[email protected]> wrote:
Greetings everyone,

HBASE-24749 has been proposed almost a year ago, introducing a
new
StoreFile tracker as a way to allow for any hbase hfile
modifications
to
be
safely completed without needing a file system rename. This
seems
pretty
relevant for deployments over S3 file systems, where rename
operations
are
not atomic and can have a performance degradation whenmultiple
requests
get concurrently submitted to the same bucket. We had done
superficial
tests and ycsb runs, where individual renames of files larger
than
5GB
can
take a few hundreds of seconds to complete. We also observed
impacts
in
write loads throughput, the bottleneck potentially being the
renames.
With S3 being an important piece of my employer cloudsolution,
we
would
like to help it move forward. We plan to contribute newpatches
per
the
original design/Jira, but we’d also be happy to review changes
from
the
original authors, too. Please let us know if anyone has any
concerns,
otherwise we’ll start to self-assign issues on HBASE-24749

Wellington
--
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from
truth's
decrepit hands
  - A23, Crosstalk

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Reply via email to