[jira] [Commented] (FLUME-2128) HDFS Sink rollSize is calculated based off of uncompressed size of cumulative events.

Ted Malaska (JIRA) Sun, 18 Aug 2013 14:13:30 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743382#comment-13743382
 ]


Ted Malaska commented on FLUME-2128:
------------------------------------

Ok I reviewed the code.  First I will list what I found, then I will give my 
recommendation for a fix to see if it gets approved.

What I found:
1)rollSize is sent to BucketWriter
2)It is compared with Process Size which is the pre-compressed bytes
3)The HDFSWriter interface doesn't not supper length
4)There are only three direct implementation of the HDFSWriter interface and 
two of them are test implementations 
5)The one non-test implementation is an abstract class with HDFSSequenceFile 
and HDFSDataStream extending it.  
6)Of the two child implementation only HDFSSequenceFile even supports 
compression

What I'm thinking in terms of a fix:
1) Add a new config called rollCompressedSize
2) Add getLength from HDFSWriter 
3) Implement the getLength method in AbstractHDFSWriter to output the 
uncompressed number of bytes written
4) Override that getLength in the HDFSSequenceFile implement to return the turn 
get length number
5) Update the test implementations and add util test for the new roll over 
logic.

Let me know what you think
                
> HDFS Sink rollSize is calculated based off of uncompressed size of cumulative 
> events.
> -------------------------------------------------------------------------------------
>
>                 Key: FLUME-2128
>                 URL: https://issues.apache.org/jira/browse/FLUME-2128
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.4.0, v1.3.1
>            Reporter: Jeff Lord
>
> The hdfs sink rollSize parameter is compared against uncompressed event sizes.
> The net of this is that if you are using compression and expect the size of 
> your files on HDFS to be rolled/sized based on the value set for rollSize 
> than your files will be much smaller due to compression.
> We should take into account when compression is set and roll based on the 
> compressed size on hdfs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-2128) HDFS Sink rollSize is calculated based off of uncompressed size of cumulative events.

Reply via email to