[jira] Commented: (HADOOP-3315) New binary file format

Hong Tang (JIRA) Thu, 29 Jan 2009 14:20:27 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668640#action_12668640
 ]


Hong Tang commented on HADOOP-3315:
-----------------------------------

bq. Any advantage to our making a scanner around a start and end key random 
accessing or, if I read things properly, there is none since we only fetch 
actual blocks when seekTo is called.

There is no performance advantage. But there is a semantic difference. If you 
create a range scanner, and call seekTo with a key outside the scan range, it 
will return false even if the key exists in the TFile.

bq. And on concurrent access, if we have say, random-accesses concurrent with a 
couple of whole-file scans my reading has it that scanners fetch a block just 
as it needs it and then works against this fetched copy. The fetch is 
'synchronized' which means lots of seeking around in the file but otherwise, it 
looks like there is no need for the application to synchronize access to tfile.

Yes, there is no need to synchronize threads accessing to the same TFile, as 
long as each have its own scanner. However, concurrent access is not as 
performant as it could be due to the current design of HDFS. If multiple 
threads scan different regions of the same TFiles, the actual IO calls to 
FSDataInputStream are synchronized. I tried positioned read (which would avoid 
synchronizing on reads) but the overhead of frequent connection establishment 
makes single-threaded case much worse. Connection caching for positioned reads 
may help (HADOOP-3672).

bq. Hmm. Looking at doing random accesses and it seems like a bunch of time is 
spent in inBlockAdvance advancing sequentially through blocks rather than do 
something like a binary search to find desired block location. Also, as we 
advance, we create and destroy a bunch of objects such as the stream to hold 
the value. Can you comment on why this is (compression should be on tfile block 
boundaries, right so nothing to stop hopping into the midst of a tfile)? Thanks.

inBlockAdvance() sequentially goes through key-value pairs INSIDE one 
compressed block. Binary searching for desired block is done through 
Reader.getBlockContainsKey(). Additionally, the code takes care of the case 
when you want to seek to a key in the later part of the block. When advancing, 
no objects are created. The closing of the value stream is to force the code to 
skip the remaining bytes of the value in case the application consumes part of 
the value bytes.

bq. Are you going to upload another patch? If so, I'll keep my +1 for that.
Will do shortly.

> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-3315_20080908_TFILE_PREVIEW_WITH_LZO_TESTS.patch, 
> HADOOP-3315_20080915_TFILE.patch, hadoop-trunk-tfile.patch, 
> hadoop-trunk-tfile.patch, TFile Specification 20081217.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs 
> to compress or decompress. It would be good to have a file format that only 
> needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3315) New binary file format

Reply via email to