[jira] Commented: (HADOOP-3315) New binary file format

Hong Tang (JIRA) Wed, 17 Sep 2008 13:08:36 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631905#action_12631905
 ]


Hong Tang commented on HADOOP-3315:
-----------------------------------

A preliminary study on performance comparison between TFile and SequenceFile.

Settings:
- OS: RHEL AS4 Nahant Update 2. Linux 2.6.9-55.ELsmp.
- Hardware: dual 2GHz CPU, 4GB main memory, WD Caviar 400GB (with 60GB free 
space).
- Key length: uniform random 50-100B.
- Value length: uniform random 5K-10K.
- Both keys and values are composed by using a "dictionary" of 1000 "words", 
each word's length is uniformly distributed between 5-20B.
- Compression schemes: none, lzo, and gz. The # of <key, value> pairs are 600K, 
3M, 3M for the three cases. The output file sizes in the three cases are 4.4G, 
10G, and 6G respectively. This eliminates the file caching effect.
- Used SequenceFile and TFile to implement two common interfaces: one for 
writing, and one for reading. As follows:

{code}
private interface KVAppendable {
  public void append(BytesWritable key, BytesWritable value)
      throws IOException;
  public void close() throws IOException;
}

private interface KVReadable {
  public byte[] getKey();
  public byte[] getValue();
  public int getKeyLength();
  public int getValueLength();
  public boolean next() throws IOException;
  public void close() throws IOException;
}
{code}

Both interfaces allow for efficient implementations by either TFile or 
SequenceFile to avoid any object creations or buffer copying to conform to the 
interface.

Some finer details:
- For writing, the timing included the append() call (so TFile meta data 
writing is not included), it also includes the time being used to compose keys 
and values. The same seed is used to construct a Random object that does the 
key/value composition.
- For reading, the timing includes the next() call followed by getKeyLength() 
and getValueLength(). (getKey() and getValue() simply returns internally cached 
key/value buffers, and is an O(1) operation).
- For each compression scheme, I run and time the following tasks: create 
sequence file, read sequence file, create Tfile, read Tfile, create Tfile, read 
Tfile, create sequence file, read Tfile. Then I only pick the better 
performance of the two. This is to remove the possible effects with JVM hotspot 
compilation, garbage collection, and/or occassional host-related activities (I 
have seen tar, yum on top screen from time to time).
- Memory footprint: For SeqFile, it needs to cache the full block of 
uncompressed data and compressed data, each in the order of 1MB. So the total 
buffering is about 2MB. For TFile, the block size is set to 10MB, but the 
amount of buffering is 4K (buffering for small writes before the 
compression/decompression stream) + 256K (FS read/write buffering) + 1MB (for 
writes if value length is not known, this is also tunable). They are comparable 
but TFile is more malleable.

Finally, the results:

||      ||SeqFile-none ||       TFile-none ||   SeqFile-lzo ||  TFile-lzo ||    
SeqFile-gz ||   TFile-gz||
|Data sizes (MB) |      4435.98 |       4435.98 |       22179.13 |      
22179.13 |      22179.13 |      22179.13 |
|File sizes (MB) |      4456.58 |       4438.31 |       10080.23 |      
10063.48 |      6236.91 |       5943.07 |
|Write Eff BW (MB/s) |  36.86 | 35.53 | 38.46 | 39.97 | 13.59 | 13.54 |
|Write I/O BW (MB/s) |  37.03 | 35.54 | 17.48 | 18.14 | 3.82 |  3.63 |
|Read Eff BW (MB/s) |   41.13 | 40.16 | 86.77 | 91.04 | 52.73 | 75.15 |
|Read I/O BW (MB/s) |   41.32 | 40.18 | 39.44 | 41.31 | 14.83 | 20.14 |

Things to notice:
- In most cases, SeqFile and TFile performance are similar.
- TFile sizes are usually smaller than SeqFile size - SeqFile encodes length 
using 4B, and TFile uses VInt. The setup of the benchmark favors TFile in this 
regard, fixed key/value sizes may make SeqFile smaller.
- For none compression, SeqFile outperforms TFile, because both only needs one 
layer of buffering and TFile interface does require the creation of more small 
objects to set up.
- For lzo and gz compression. TFile outperforms SeqFile due to the reduction of 
extra buffer copying.
- Particularly for gz compression, the read performance of TFile is 42% faster. 
The reason for that is because DecompressorStream always reads as much data as 
possible from the downstream using the internal buffer size. And I took 
advantage of that by skipping my own FS buffering. So bulk reads (reading a 
value) incurs the least amount of buffer copying. (Not true for LZO, which uses 
block compression, and the BlockDecompressorStream always read small blocks in 
the order of 20KB). Reduced buffer copy saves CPU cycles and in the case of GZ, 
CPU is the bottleneck.

The above results are still preliminary. No YourKit profiling is done on TFile 
side. And results could vary for different settings - different key value 
lengths, compression ratios, underlying I/O speed, etc.

> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>         Attachments: HADOOP-3315_20080908_TFILE_PREVIEW_WITH_LZO_TESTS.patch, 
> HADOOP-3315_20080915_TFILE.patch, TFile Specification Final.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs 
> to compress or decompress. It would be good to have a file format that only 
> needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3315) New binary file format

Reply via email to