[jira] [Comment Edited] (FLUME-2128) HDFS Sink rollSize is calculated based off of uncompressed size of cumulative events.

Mike Percy (JIRA) Thu, 24 Oct 2013 18:48:59 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804918#comment-13804918
 ]


Mike Percy edited comment on FLUME-2128 at 10/25/13 1:47 AM:
-------------------------------------------------------------

A few thoughts on this (just posted a couple comments inline on review board as 
well):

1. Can keep the behavior standard for all types of writers? i.e. no new config 
options if at all possible, and if HDFSWriter implementation uses the file 
system in some way then all the other implementations should too.
2. There is a flaw in this implementation because most of the data written so 
far in this batch will be held in the encoder buffer, so this approach won't 
work in many real-world cases. GZip is especially guilty of keeping tons of 
stuff in memory and rarely flushing to disk.
3. Even if this approach did work, it requires some back-of-the-envelope 
calculations on behalf of the operator because it's a lower-bound instead of an 
upper-bound. Nothing new, it's always been like that... But it would be awesome 
for this to be an upper-bound so you can just set this to your HDFS block size 
and forget it, or even have a config option to roll at the detected block size.

If we want to do upper-bound then it would need to be done probabilistically, 
where most of the time we won't go over... which for most use cases I believe 
should be ideal if it's close to 99% of the time. We could just keep some 
in-memory statistics about how much a given HDFS Sink configuration tends to 
write based on the number of input bytes in the batch (keep some kind of 
histogram). Use a confidence interval to determine that we are pretty sure the 
next write won't put us over the top, so go ahead and write the event 
(assumedly within the current batch), otherwise roll.

Please let me know what you guys think of the above idea.


was (Author: mpercy):
A few thoughts on this (just posted a couple comments inline on review board as 
well):

1. Can we make this the standard behavior for all types of writers?
2. There is a flaw in this implementation because most of the data written so 
far in this batch will be held in the encoder buffer, so this approach won't 
work in many real-world cases. GZip is especially guilty of keeping tons of 
stuff in memory and rarely flushing to disk.
3. Even if this approach did work, it requires some back-of-the-envelope 
calculations on behalf of the operator because it's a lower-bound instead of an 
upper-bound. Nothing new, it's always been like that... But it would be awesome 
for this to be an upper-bound so you can just set this to your HDFS block size 
and forget it, or even have a config option to roll at the detected block size.

If we want to do upper-bound then it would need to be done probabilistically, 
where most of the time we won't go over... which for most use cases I believe 
should be ideal if it's close to 99% of the time. We could just keep some 
in-memory statistics about how much a given HDFS Sink configuration tends to 
write based on the number of input bytes in the batch (keep some kind of 
histogram). Use a confidence interval to determine that we are pretty sure the 
next write won't put us over the top, so go ahead and write the event 
(assumedly within the current batch), otherwise roll.

Please let me know what you guys think of the above idea.

> HDFS Sink rollSize is calculated based off of uncompressed size of cumulative 
> events.
> -------------------------------------------------------------------------------------
>
>                 Key: FLUME-2128
>                 URL: https://issues.apache.org/jira/browse/FLUME-2128
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.4.0, v1.3.1
>            Reporter: Jeff Lord
>            Assignee: Ted Malaska
>              Labels: features
>         Attachments: FLUME-2128-0.patch, FLUME-2128-1.patch
>
>
> The hdfs sink rollSize parameter is compared against uncompressed event sizes.
> The net of this is that if you are using compression and expect the size of 
> your files on HDFS to be rolled/sized based on the value set for rollSize 
> than your files will be much smaller due to compression.
> We should take into account when compression is set and roll based on the 
> compressed size on hdfs.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Comment Edited] (FLUME-2128) HDFS Sink rollSize is calculated based off of uncompressed size of cumulative events.

Reply via email to