[
https://issues.apache.org/jira/browse/FLUME-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743382#comment-13743382
]
Ted Malaska commented on FLUME-2128:
------------------------------------
Ok I reviewed the code. First I will list what I found, then I will give my
recommendation for a fix to see if it gets approved.
What I found:
1)rollSize is sent to BucketWriter
2)It is compared with Process Size which is the pre-compressed bytes
3)The HDFSWriter interface doesn't not supper length
4)There are only three direct implementation of the HDFSWriter interface and
two of them are test implementations
5)The one non-test implementation is an abstract class with HDFSSequenceFile
and HDFSDataStream extending it.
6)Of the two child implementation only HDFSSequenceFile even supports
compression
What I'm thinking in terms of a fix:
1) Add a new config called rollCompressedSize
2) Add getLength from HDFSWriter
3) Implement the getLength method in AbstractHDFSWriter to output the
uncompressed number of bytes written
4) Override that getLength in the HDFSSequenceFile implement to return the turn
get length number
5) Update the test implementations and add util test for the new roll over
logic.
Let me know what you think
> HDFS Sink rollSize is calculated based off of uncompressed size of cumulative
> events.
> -------------------------------------------------------------------------------------
>
> Key: FLUME-2128
> URL: https://issues.apache.org/jira/browse/FLUME-2128
> Project: Flume
> Issue Type: Bug
> Components: Sinks+Sources
> Affects Versions: v1.4.0, v1.3.1
> Reporter: Jeff Lord
>
> The hdfs sink rollSize parameter is compared against uncompressed event sizes.
> The net of this is that if you are using compression and expect the size of
> your files on HDFS to be rolled/sized based on the value set for rollSize
> than your files will be much smaller due to compression.
> We should take into account when compression is set and roll based on the
> compressed size on hdfs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira