Eric Hanson created HADOOP-11188:
------------------------------------

             Summary: hadoop-azure: automatically expand page blobs when they 
become full
                 Key: HADOOP-11188
                 URL: https://issues.apache.org/jira/browse/HADOOP-11188
             Project: Hadoop Common
          Issue Type: Improvement
          Components: fs
            Reporter: Eric Hanson


Right now, page blobs are initialized to a fixed size (fs.azure.page.blob.size) 
and cannot be expanded. This task is to make them automatically expand when 
they get to be nearly full.

Design: if a write occurs that does not have enough room in the file to finish, 
then flush all preceding operations, extend the file, and complete the write. 
This will be synchronized (to have exclusive access) in access to 
PageBlobOutputStream so there won't be race conditions.

The file will be extended by fs.azure.page.blob.extension.size bytes, which 
must be a multiple of 512. The internal default for 
fs.azure.page.blob.extension size will be 128 * 1024 * 1024. The minimum 
extension size will be 4 * 1024 * 1024 which is the maximum write size, so the 
new write will finish. 

Extension will stop when the file size reaches 1TB. The final extension may be 
less than fs.azure.page.blob.extension.size if the remainder (1TB - 
current_file_size) is smaller than fs.azure.page.blob.extension.size.

An alternative to this is to make the default size 1TB. This is much simpler to 
implement. It's a one-line change. Or even simpler, don't change it at all 
because it is adequate for HBase.

Rationale for this file size extension feature:

1) be able to download files to local disk easily with CloudXplorer and similar 
tools. Downloading a 1TB page blob is not practical if you don't have 1TB disk 
space since on the local side it expands to the full file size, locally filled 
with zeros where there is no valid data.

2) don't make customers uncomfortable when they see large 1TB files. They often 
ask if they have to pay for it, even though they only pay for the space 
actually used in the page blob.

I think rationale 2 is a relatively minor issue, because 98% of customers for 
HBase will never notice. They will just use it and not look at what kind of 
files are used for the logs. They don't pay for the unused space, so it is not 
a problem for them. We can document this. Also, if they use hadoop fs -ls, they 
will see the actual size of the files since I put in a fix for that.

Rationale 1 is a minor issue because you cannot interpret the data on your 
local file system anyway due to the data format. So really, the only reason to 
copy data locally in its binary format would be if you are moving it around or 
archiving it. Copying a 1TB page blob from one location in the cloud to another 
is pretty fast with smart copy utilities that don't actually move the 0-filled 
parts of the file.

Nevertheless, this is a convenience feature for users. They won't have to worry 
about setting fs.azure.page.blob.size under normal circumstances and can make 
the files grow as big as they want.

If we make the change to extend the file size on the fly, that introduces new 
possible error or failure modes for HBase. We should included retry logic. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to