[ 
https://issues.apache.org/jira/browse/HDFS-5698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883580#comment-13883580
 ] 

Haohui Mai commented on HDFS-5698:
----------------------------------

I took a fsimage from a production cluster, and scaled it down to different 
sizes to evaluate the performance and the size impact.

I ran the test on a machine that has an 8-core Xeon E5530 CPU @ 2.4GHz, 24G 
memory, 2TB SATA 3 drive @ 7200 rpm. The machine is running RHEL 6.2, Java 1.6. 
The JVM has a maximum heap size of 20G, and it runs the concurrent mark and 
sweep GC.

Here are the numbers:

|Size in Old|512M|1G|2G|4G|8G| 
|Size in PB|469M|950M|1.9G|3.7G|7.0G| 
|Saving in Old (ms)|14678|28991|60520|96894|160878| 
|Saving in PB (ms)|14709|16746|32623|83645|168617| 
|Loading in Old (ms)|12819|24664|48240|114090|307689|
|Loading in PB (ms)|28268|43205|87060|266681|491605| 

The first two rows show the size of the fsimage in both the old and the new 
format respectively. The third and the forth row show the time of saving the 
fsimage in two different formats, and the last two rows show the time of 
loading the fsimage in two different format.

The new fsimage format is slightly more compact. The code writes the new 
fsimage slightly faster. Currently the new fsimage format loads slower. 
However, in the new format most of the loading process can be parallelized. I 
plan to introduce this feature after the branch is merged.

> Use protobuf to serialize / deserialize FSImage
> -----------------------------------------------
>
>                 Key: HDFS-5698
>                 URL: https://issues.apache.org/jira/browse/HDFS-5698
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Haohui Mai
>            Assignee: Haohui Mai
>         Attachments: HDFS-5698.000.patch, HDFS-5698.001.patch
>
>
> Currently, the code serializes FSImage using in-house serialization 
> mechanisms. There are a couple disadvantages of the current approach:
> # Mixing the responsibility of reconstruction and serialization / 
> deserialization. The current code paths of serialization / deserialization 
> have spent a lot of effort on maintaining compatibility. What is worse is 
> that they are mixed with the complex logic of reconstructing the namespace, 
> making the code difficult to follow.
> # Poor documentation of the current FSImage format. The format of the FSImage 
> is practically defined by the implementation. An bug in implementation means 
> a bug in the specification. Furthermore, it also makes writing third-party 
> tools quite difficult.
> # Changing schemas is non-trivial. Adding a field in FSImage requires bumping 
> the layout version every time. Bumping out layout version requires (1) the 
> users to explicitly upgrade the clusters, and (2) putting new code to 
> maintain backward compatibility.
> This jira proposes to use protobuf to serialize the FSImage. Protobuf has 
> been used to serialize / deserialize the RPC message in Hadoop.
> Protobuf addresses all the above problems. It clearly separates the 
> responsibility of serialization and reconstructing the namespace. The 
> protobuf files document the current format of the FSImage. The developers now 
> can add optional fields with ease, since the old code can always read the new 
> FSImage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to