Re: Significance of multiple segment files in a partition

Neha Narkhede Tue, 25 Oct 2011 09:10:48 -0700

Inder,

>> so is it right to say that log retention property set to X days uses the 
>> last activity on a segment file to determine when to delete a file and if 
>> the file size is to set to a large number and the same file keeps getting 
>> appended on a daily basis then we won't achieve the 7 day cleanup till 
>> either there isn't any activity  done for 7 days or it has reached the 
>> bigger size and rolled over and stays there for 7 days.


That is right.

>>  on the other hand a smaller file size will ensure that it rolls over 
>> multiple times in 7 days and the segments untouched in 7 days can be knocked 
>> off thus optimizing space usage.

Optimizing space usage is not really the goal right ? Kafka
performance doesn't degrade with large accumulated data on the brokers
and also disk space is cheap. If you are concerned about limiting the
log retention to avoid excessive disk usage, you can use the
log.retention.size parameter to control the garbage collection based
on size of log, instead of time.

>> are the default settings based on certain experimentation and recommended 
>> for production use.

The log file size is 500 MB and retention is 7 days. We've disabled
sized based garbage collection, since that really depends on specific
environments and applications. I guess using a combination of
log.retention.hours and log.retention.size is probably the best
approach. That way you roughly know how long the data will be
available (for lazy consumers) and you won't run out of disk space.

Thanks,
Neha

On Tue, Oct 25, 2011 at 8:14 AM, Inder Pall <inder.p...@gmail.com> wrote:
> guys,
>
> so is it right to say that log retention property set to X days uses the
> last activity on a segment file
> to determine when to delete a file and if the file size is to set to a large
> number and the same file keeps getting
> appended on a daily basis then we won't achieve the 7 day cleanup till
> either there isn't any activity  done for 7 days or
> it has reached the bigger size and rolled over and stays there for 7 days.
>
> on the other hand a smaller file size will ensure that it rolls over
> multiple times in 7 days and the segments untouched in 7 days can be knocked
> off
> thus optimizing space usage.
>
> are the default settings based on certain experimentation and recommended
> for production use.
>
> - Inder
>
> On Tue, Oct 25, 2011 at 7:53 PM, Neha Narkhede <neha.narkh...@gmail.com>wrote:
>
>> Inder,
>>
>> >> 2. Why would you want to have multiple files within a partition. Broker
>> has
>> >> to store more info to figure the right file among a partition.
>>
>> There is not much advantage apart from better accuracy with the
>> getLatestOffeset API.
>> Using that if you want to start consuming data close to a certain
>> timestamp,
>> you get better accuracy if you have smaller log files.
>>
>> >> 3. Is it to achieve mmap kinda optimization and allowing the broker to
>> do
>> >> less I/O in case a feed is really huge or any thing else.
>>
>> Not really. mmap is useful when you have random access on large files, or
>> have multiple process trying to access the same file. It might actually not
>> work well with large files if your memory is fragmented. Since we have
>> sequential IO patterns, the filesystem caching itself works very well.
>>
>> Thanks,
>> Neha
>>
>> On Tuesday, October 25, 2011, Jay Kreps <jay.kr...@gmail.com> wrote:
>> > It is actually just to allow data deletion, we just delete whole segments
>> in
>> > the cleanup. There is not much value to tuning the file size for most
>> > situations, but the tradeoff is that with smaller files you will have
>> more
>> > open files but be closer to your desired retention.hours and
>> retention.size
>> > settings.
>> >
>> > -Jay
>> >
>> > On Tue, Oct 25, 2011 at 1:59 AM, Inder Pall <inder.p...@gmail.com>
>> wrote:
>> >
>> >> i am playing around with "log.file.size"(controls the size of a segment
>> >> file
>> >> in a partition) and "log.retention.hours" with the following config.
>> >> log.file.size=500
>> >> log.retention.hours=168
>> >>
>> >> Observation - i see multiple files getting generated within the same
>> >> partition.
>> >> Example : my topic name is revenue feed and i see the following
>> >>
>> >> ls -lh /tmp/kafka-logs/revenuefeed-0/*
>> >> -rw-r--r-- 1 inder users 537 Oct 25 01:38
>> >> /tmp/kafka-logs/revenuefeed-0/00000000000000000000.kafka
>> >> -rw-r--r-- 1 inder users 512 Oct 25 01:39
>> >> /tmp/kafka-logs/revenuefeed-0/00000000000000000537.kafka
>> >>
>> >> Questions
>> >> --------------
>> >> 1. Shouldn't these two properties go hand in hand
>> >> 2. Why would you want to have multiple files within a partition. Broker
>> has
>> >> to store more info to figure the right file among a partition.
>> >> 3. Is it to achieve mmap kinda optimization and allowing the broker to
>> do
>> >> less I/O in case a feed is really huge or any thing else.
>> >>
>> >> -- Inder
>> >>
>> >
>>
>
>
>
> --
> -- Inder
>

Re: Significance of multiple segment files in a partition

Reply via email to