Hey Dong,

Sorry for being slow to catch up to this.

I think the benefit of the sanity check seems a little dubious in the first
place. We detect garbage at the end of the index file, but that's about it.
Is there any reason to think that corruption is more likely to occur there
or any other reason to think this check is still beneficial for flushed
data? I assume we did the check because we presumed it was cheap, but
perhaps the cost is adding up as the number of partitions grows. How much
does startup time improve if we skip the sanity check for data earlier than
the recovery point? Does the lazy loading itself give some additional
benefit beyond skipping the sanity check? As Jay mentions above, the sanity
checks seem strictly speaking optional. We don't bother checking the
segments themselves for example.

Thanks,
Jason




It probably still makes sense for segments beyond the recovery point

On Wed, Mar 21, 2018 at 9:59 PM, Dong Lin <lindon...@gmail.com> wrote:

> Hey Jay,
>
> Yeah our existing sanity check only read the last entry in the index files.
> I must have miscommunicated if I previously said it was reading the full
> index. Broker appears to be spending a lot of time just to read the last
> entry of index files for every log segment. This is probably because OS
> will load a chunk of data that is much larger than the entry itself from
> disk to page cache. This KIP tries to make this part of operation lazy. I
> guess you are suggesting that we should just make the lazy loading the
> default behavior?
>
> Yes we currently require manual intervention if the log file is corrupted,
> i.e. if two messages with the same offset are appended to the disk
> (KAFKA-6488). The sanity check on broker startup is a bit different since
> it deals with the corruption of index files (e.g. offset index, time index
> and snapshot files) instead of the log data. In this case if index files
> are corrupted broker will automatically recover it by rebuilding the index
> files using data in the log files, without requiring manual intervention.
> Thus the design question is whether this should be done before broker can
> become leader for any partitions -- there is tradeoff between broker
> startup time and risk of delaying user requests if broker need to rebuild
> index files when it is already leader. I prefer lazy loading to reduce
> broker startup time. Not sure what are the feedback from the community on
> this issue.
>
> Thanks,
> Dong
>
>
> On Wed, Mar 21, 2018 at 7:36 AM, Jay Kreps <j...@confluent.io> wrote:
>
> > Hey Dong,
> >
> > Makes total sense. What I'm saying is I don't think that the sanity check
> > is part of any formal guarantee we provide. It is true that corruption of
> > data flushed to disk will be a potential problem, but I don't think the
> > sanity check solves that it just has a couple heuristics to help detect
> > certain possible instances of it, right? In general I think our
> assumption
> > has been that flushed data doesn't disappear or get corrupted and if it
> > does you need to manually intervene. I don't think people want to
> configure
> > things at this level so what I was suggesting was understanding why the
> > sanity check is slow and trying to avoid that rather than making it
> > configurable. I think you mentioned it was reading the full index into
> > memory. Based on the performance you describe this could be true, but it
> > definitely should not be reading anything but the last entry in the index
> > so that would be a bug. That read also happens in sanityCheck() only in
> the
> > time-based index right? In the offset index we do the same read but it
> > happens in initialization. If that read is the slow thing it might make
> > sense to try to remove it or make it lazy in both cases. If it is some
> > other part of the code then (e.g. the size check) then that may be able
> to
> > be avoided entirely (I think by the time we sanity check we already know
> > the file size from the mapping...). That was what I meant by doing some
> > data driven analysis. Maybe a quick run with hprof would help determine
> the
> > root cause of why sanityCheck is slow?
> >
> > -Jay
> >
> > On Tue, Mar 20, 2018 at 12:13 AM Dong Lin <lindon...@gmail.com> wrote:
> >
> > > Hey Jay,
> > >
> > > Thanks for your comments!
> > >
> > > Yeah recovery is different from the sanity check. They are correlated
> in
> > > the sense that there may still be corrupted index files even after
> clean
> > > broker shutdown. And in that case if we delay the sanity check then we
> > may
> > > delay the log recovery. The main goal of this KIP is to optimize the
> > sanity
> > > check related work so that it does not delay the broker startup much.
> > >
> > > The KIP mentioned that the sanity check is done using log recovery
> > > background thread. The name "recovery" is mentioned mainly because the
> > > background thread number is determined using the existing
> > > config num.recovery.threads.per.data.dir. I have updated the KIP to
> make
> > > this less confusing.
> > >
> > > It makes a ton of sense to optimize the broker startup time in a data
> > > driven fashion. The currently optimize is done kind of in this fashion.
> > The
> > > broker log shows that LogManager.loadLogs() takes a long time in large
> > > clusters. Then I started broker with cold cache and repeatedly get
> thread
> > > dump to see what are broker threads are doing during
> > LogManager.loadLogs().
> > > Most of the threads are working on sanityCheck() and this motivates the
> > > change in this KIP. Previously broker shutdown time was investigated
> in a
> > > similar data driven fashion and optimized with KAFKA-6172 and
> KAFKA-6175.
> > > It seems that the current KIP can reduces the rolling bounce time of a
> > > large cluster by 50% -- there may be room for further improvement but
> > maybe
> > > those do not require as big a change (with the caveat described in the
> > KIP)
> > > as suggested in this KIP.
> > >
> > > It is not clear whether it is safe to just read the latest segment
> > without
> > > sanity checking all previous inactive segment of a given partition if
> > > transaction is used. Otherwise we probably want to always skip the
> sanity
> > > check of inactive segments without introducing a new config. Maybe the
> > > developers familiar with the transaction can comment on that?
> > >
> > > Thanks,
> > > Dong
> > >
> > >
> > > On Mon, Mar 19, 2018 at 7:21 PM, Jay Kreps <j...@confluent.io> wrote:
> > >
> > > > Optimizing startup seems really valuable but I'm a little confused by
> > > this.
> > > >
> > > > There are two different things:
> > > > 1. Recovery
> > > > 2. Sanity check
> > > >
> > > > The terminology we're using is a bit mixed here.
> > > >
> > > > Recovery means checksumming the log segments and rebuilding the index
> > on
> > > a
> > > > hard crash. This only happens on unflushed segments, which is
> generally
> > > > just the last segment. Recovery is essential for the correctness
> > > guarantees
> > > > of the log and you shouldn't disable it. It only happens on hard
> crash
> > > and
> > > > is not a factor in graceful restart. We can likely optimize it but
> that
> > > > would make most sense to do in a data driven fashion off some
> > profiling.
> > > >
> > > > However there is also a ton of disk activity that happens during
> > > > initialization (lots of checks on the file size, absolute path,
> etc). I
> > > > think these have crept in over time with people not really realizing
> > this
> > > > code is perf sensitive and java hiding a lot of what is and isn't a
> > file
> > > > operation. One part of this is the sanityCheck() call for the two
> > > indexes.
> > > > I don't think this call reads the full index, just the last entry in
> > the
> > > > index, right?. There should be no need to read the full index except
> > > during
> > > > recovery (and then only for the segments being recovered). I think it
> > > would
> > > > make a ton of sense to optimize this but I don't think that
> > optimization
> > > > needs to be configurable as this is just a helpful sanity check to
> > detect
> > > > common non-sensical things in the index files, but it isn't part of
> the
> > > > core guarantees, in general you aren't supposed to lose committed
> data
> > > from
> > > > disk, and if you do we may be able to fail faster but we
> fundamentally
> > > > can't really help you. Again I think this would make the most sense
> to
> > do
> > > > in a data driven way, if you look at that code I think it is doing
> > crazy
> > > > amounts of file operations (e.g. getAbsolutePath, file sizes, etc). I
> > > think
> > > > it'd make most sense to profile startup with a cold cash on a large
> log
> > > > directory and do the same with an strace to see how many redundant
> > system
> > > > calls we do per segment and what is costing us and then cut some of
> > this
> > > > out. I suspect we could speed up our startup time quite a lot if we
> did
> > > > that.
> > > >
> > > > For example we have a bunch of calls like this:
> > > >
> > > >     require(len % entrySize == 0,
> > > >
> > > >             "Index file " + file.getAbsolutePath + " is corrupt,
> found
> > "
> > > +
> > > > len +
> > > >
> > > >             " bytes which is not positive or not a multiple of 8.")
> > > > I'm pretty such file.getAbsolutePath is a system call and I assume
> that
> > > > happens whether or not you fail the in-memory check?
> > > >
> > > > -Jay
> > > >
> > > >
> > > > On Sun, Feb 25, 2018 at 10:27 PM, Dong Lin <lindon...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I have created KIP-263: Allow broker to skip sanity check of
> inactive
> > > > > segments on broker startup. See
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > 263%3A+Allow+broker+to+skip+sanity+check+of+inactive+
> > > > > segments+on+broker+startup
> > > > > .
> > > > >
> > > > > This KIP provides a way to significantly reduce time to rolling
> > bounce
> > > a
> > > > > Kafka cluster.
> > > > >
> > > > > Comments are welcome!
> > > > >
> > > > > Thanks,
> > > > > Dong
> > > > >
> > > >
> > >
> >
>

Reply via email to