[
https://issues.apache.org/jira/browse/FLUME-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804918#comment-13804918
]
Mike Percy edited comment on FLUME-2128 at 10/25/13 1:47 AM:
-------------------------------------------------------------
A few thoughts on this (just posted a couple comments inline on review board as
well):
1. Can keep the behavior standard for all types of writers? i.e. no new config
options if at all possible, and if HDFSWriter implementation uses the file
system in some way then all the other implementations should too.
2. There is a flaw in this implementation because most of the data written so
far in this batch will be held in the encoder buffer, so this approach won't
work in many real-world cases. GZip is especially guilty of keeping tons of
stuff in memory and rarely flushing to disk.
3. Even if this approach did work, it requires some back-of-the-envelope
calculations on behalf of the operator because it's a lower-bound instead of an
upper-bound. Nothing new, it's always been like that... But it would be awesome
for this to be an upper-bound so you can just set this to your HDFS block size
and forget it, or even have a config option to roll at the detected block size.
If we want to do upper-bound then it would need to be done probabilistically,
where most of the time we won't go over... which for most use cases I believe
should be ideal if it's close to 99% of the time. We could just keep some
in-memory statistics about how much a given HDFS Sink configuration tends to
write based on the number of input bytes in the batch (keep some kind of
histogram). Use a confidence interval to determine that we are pretty sure the
next write won't put us over the top, so go ahead and write the event
(assumedly within the current batch), otherwise roll.
Please let me know what you guys think of the above idea.
was (Author: mpercy):
A few thoughts on this (just posted a couple comments inline on review board as
well):
1. Can we make this the standard behavior for all types of writers?
2. There is a flaw in this implementation because most of the data written so
far in this batch will be held in the encoder buffer, so this approach won't
work in many real-world cases. GZip is especially guilty of keeping tons of
stuff in memory and rarely flushing to disk.
3. Even if this approach did work, it requires some back-of-the-envelope
calculations on behalf of the operator because it's a lower-bound instead of an
upper-bound. Nothing new, it's always been like that... But it would be awesome
for this to be an upper-bound so you can just set this to your HDFS block size
and forget it, or even have a config option to roll at the detected block size.
If we want to do upper-bound then it would need to be done probabilistically,
where most of the time we won't go over... which for most use cases I believe
should be ideal if it's close to 99% of the time. We could just keep some
in-memory statistics about how much a given HDFS Sink configuration tends to
write based on the number of input bytes in the batch (keep some kind of
histogram). Use a confidence interval to determine that we are pretty sure the
next write won't put us over the top, so go ahead and write the event
(assumedly within the current batch), otherwise roll.
Please let me know what you guys think of the above idea.
> HDFS Sink rollSize is calculated based off of uncompressed size of cumulative
> events.
> -------------------------------------------------------------------------------------
>
> Key: FLUME-2128
> URL: https://issues.apache.org/jira/browse/FLUME-2128
> Project: Flume
> Issue Type: Bug
> Components: Sinks+Sources
> Affects Versions: v1.4.0, v1.3.1
> Reporter: Jeff Lord
> Assignee: Ted Malaska
> Labels: features
> Attachments: FLUME-2128-0.patch, FLUME-2128-1.patch
>
>
> The hdfs sink rollSize parameter is compared against uncompressed event sizes.
> The net of this is that if you are using compression and expect the size of
> your files on HDFS to be rolled/sized based on the value set for rollSize
> than your files will be much smaller due to compression.
> We should take into account when compression is set and roll based on the
> compressed size on hdfs.
--
This message was sent by Atlassian JIRA
(v6.1#6144)