[
https://issues.apache.org/jira/browse/HADOOP-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167635#comment-14167635
]
Chris Nauroth commented on HADOOP-11188:
----------------------------------------
I committed this to trunk.
> hadoop-azure: automatically expand page blobs when they become full
> -------------------------------------------------------------------
>
> Key: HADOOP-11188
> URL: https://issues.apache.org/jira/browse/HADOOP-11188
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Eric Hanson
> Assignee: Eric Hanson
> Attachments: hadoop-11188.01.patch
>
>
> Right now, page blobs are initialized to a fixed size
> (fs.azure.page.blob.size) and cannot be expanded. This task is to make them
> automatically expand when they get to be nearly full.
> Design: if a write occurs that does not have enough room in the file to
> finish, then flush all preceding operations, extend the file, and complete
> the write. This will be synchronized (to have exclusive access) in access to
> PageBlobOutputStream so there won't be race conditions.
> The file will be extended by fs.azure.page.blob.extension.size bytes, which
> must be a multiple of 512. The internal default for
> fs.azure.page.blob.extension size will be 128 * 1024 * 1024. The minimum
> extension size will be 4 * 1024 * 1024 which is the maximum write size, so
> the new write will finish.
> Extension will stop when the file size reaches 1TB. The final extension may
> be less than fs.azure.page.blob.extension.size if the remainder (1TB -
> current_file_size) is smaller than fs.azure.page.blob.extension.size.
> An alternative to this is to make the default size 1TB. This is much simpler
> to implement. It's a one-line change. Or even simpler, don't change it at all
> because it is adequate for HBase.
> Rationale for this file size extension feature:
> 1) be able to download files to local disk easily with CloudXplorer and
> similar tools. Downloading a 1TB page blob is not practical if you don't have
> 1TB disk space since on the local side it expands to the full file size,
> locally filled with zeros where there is no valid data.
> 2) don't make customers uncomfortable when they see large 1TB files. They
> often ask if they have to pay for it, even though they only pay for the space
> actually used in the page blob.
> I think rationale 2 is a relatively minor issue, because 98% of customers for
> HBase will never notice. They will just use it and not look at what kind of
> files are used for the logs. They don't pay for the unused space, so it is
> not a problem for them. We can document this. Also, if they use hadoop fs
> -ls, they will see the actual size of the files since I put in a fix for that.
> Rationale 1 is a minor issue because you cannot interpret the data on your
> local file system anyway due to the data format. So really, the only reason
> to copy data locally in its binary format would be if you are moving it
> around or archiving it. Copying a 1TB page blob from one location in the
> cloud to another is pretty fast with smart copy utilities that don't actually
> move the 0-filled parts of the file.
> Nevertheless, this is a convenience feature for users. They won't have to
> worry about setting fs.azure.page.blob.size under normal circumstances and
> can make the files grow as big as they want.
> If we make the change to extend the file size on the fly, that introduces new
> possible error or failure modes for HBase. We should included retry logic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)