Jay, great information thank you.  I am in a testing phase so I have been 
continually resetting the commit offsets of my consumers before re-running 
consumer performance tests.  I realize now my retention policy was set as 7 
days, and I had added 3 new brokers at day 5 and reassigned partitions to these 
new brokers.  So it seems the partitions owned by original broker 0 have 
rolled, but the re-assignment of partitions to brokers 1,2,3 have reset the 
period of the retention policy for these partitions.  For sake of better 
consistency, maybe the current stats of the retention policy could be sent to 
the new broker during the partition reassignment.  That way, partitions on 
brokers 1,2,3 would roll at roughly the same time as the partitions on broker 
0.  Although like you said its a lower bound and perhaps not that important 
(just slightly confusing when a noob is trying to spot check the validity of a 
replica).  In the meantime I will disable the retention policy and start 
consuming at an offset that is in the range of all replicas.  Thank you again!

Luke Forehand | NetworkedInsights.com | Software Engineer

________________________________________
From: Jay Kreps <[email protected]>
Sent: Wednesday, August 28, 2013 5:29 PM
To: [email protected]
Subject: Re: replicas have different earliest offset

On a single server our retention window is always approximate and a lower
bound on what is retained since we only discard full partitions at a time.
That is if you say you want to retain 100GB and have a 1GB partition size
we will discard the last partition when doing so would not bring the
retained data below 100GB (and similarly with time).

Between servers no attempt is made to synchronize the discard of data. That
is, it is likely that all replicas will discard at roughly the same time
but this is purely a local computation for each of them. Since it is
approximate and a lower bound it does not seem useful to try to synchronize
this further.

If your consumers are bumping up against the retention window so close that
they may actually be falling off that is a problem. Indeed even in the
absence of leader change it is likely that if you are lagging this much you
will eventually fall off the end of the retention window on the leader. So
this is either a problem of retention being too small (double it) or the
consumer being fundamentally unable to keep up (in which case no amount of
retention will help).

-Jay


On Wed, Aug 28, 2013 at 2:51 PM, Luke Forehand <
[email protected]> wrote:

> I'm running into strange behavior when testing failure scenarios.  I have
> 4 brokers and 8 partitions for a topic called "feed".  I wrote a piece of
> code that prints out the partitionId, leaderId, and earliest offset for
> each partition.
>
> Here is the printed information about partition leader earliest offsets:
>
> partition:0 leader:0 offset: 1676913
> partition:1 leader:1 offset: 0
> partition:2 leader:2 offset: 0
> partition:3 leader:0 offset: 1676760
> partition:4 leader:0 offset: 1676635
> partition:5 leader:1 offset: 0
> partition:6 leader:2 offset: 0
> partition:7 leader:0 offset: 1676101
>
> I then kill broker 0 (using kill <pid>) and re-run my program
>
> partition:0 leader:1 offset: 0
> partition:1 leader:1 offset: 0
> partition:2 leader:2 offset: 0
> partition:3 leader:3 offset: 0
> partition:4 leader:1 offset: 0
> partition:5 leader:1 offset: 0
> partition:6 leader:2 offset: 0
> partition:7 leader:1 offset: 0
>
> As you can see the leaders have changed where the leader was broker 0.
>  However the earliest offset has also changed.  I was under the impression
> that a replica must have the same offset range otherwise it would confuse
> the consumer of the partition.  For example I run into an issue where
> during a failover test my consumer tries to request an offset into a
> partition on the new leader but the offset didn't exist (it was earlier
> than the earliest offset in that partition).  Can anybody explain what is
> happening?
>
> Here is my code that prints the leader partition offset information:
> https://gist.github.com/lukeforehand/c37e22aea7192e00fff5
>
> Thanks,
> Luke
>
>
>

Reply via email to