Eric Hanson created HADOOP-11188:
------------------------------------
Summary: hadoop-azure: automatically expand page blobs when they
become full
Key: HADOOP-11188
URL: https://issues.apache.org/jira/browse/HADOOP-11188
Project: Hadoop Common
Issue Type: Improvement
Components: fs
Reporter: Eric Hanson
Right now, page blobs are initialized to a fixed size (fs.azure.page.blob.size)
and cannot be expanded. This task is to make them automatically expand when
they get to be nearly full.
Design: if a write occurs that does not have enough room in the file to finish,
then flush all preceding operations, extend the file, and complete the write.
This will be synchronized (to have exclusive access) in access to
PageBlobOutputStream so there won't be race conditions.
The file will be extended by fs.azure.page.blob.extension.size bytes, which
must be a multiple of 512. The internal default for
fs.azure.page.blob.extension size will be 128 * 1024 * 1024. The minimum
extension size will be 4 * 1024 * 1024 which is the maximum write size, so the
new write will finish.
Extension will stop when the file size reaches 1TB. The final extension may be
less than fs.azure.page.blob.extension.size if the remainder (1TB -
current_file_size) is smaller than fs.azure.page.blob.extension.size.
An alternative to this is to make the default size 1TB. This is much simpler to
implement. It's a one-line change. Or even simpler, don't change it at all
because it is adequate for HBase.
Rationale for this file size extension feature:
1) be able to download files to local disk easily with CloudXplorer and similar
tools. Downloading a 1TB page blob is not practical if you don't have 1TB disk
space since on the local side it expands to the full file size, locally filled
with zeros where there is no valid data.
2) don't make customers uncomfortable when they see large 1TB files. They often
ask if they have to pay for it, even though they only pay for the space
actually used in the page blob.
I think rationale 2 is a relatively minor issue, because 98% of customers for
HBase will never notice. They will just use it and not look at what kind of
files are used for the logs. They don't pay for the unused space, so it is not
a problem for them. We can document this. Also, if they use hadoop fs -ls, they
will see the actual size of the files since I put in a fix for that.
Rationale 1 is a minor issue because you cannot interpret the data on your
local file system anyway due to the data format. So really, the only reason to
copy data locally in its binary format would be if you are moving it around or
archiving it. Copying a 1TB page blob from one location in the cloud to another
is pretty fast with smart copy utilities that don't actually move the 0-filled
parts of the file.
Nevertheless, this is a convenience feature for users. They won't have to worry
about setting fs.azure.page.blob.size under normal circumstances and can make
the files grow as big as they want.
If we make the change to extend the file size on the fly, that introduces new
possible error or failure modes for HBase. We should included retry logic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)