Try using the getOffsetsBefore API in SimpleConsumer. (There is also a
command-line tool - GetOffsetShell.)

You can specify a topic, partition and time and it will give valid offsets
prior to that time. It will be approximate though as it looks at the
modtime of the log segments in each partition. If you are using
SimpleConsumer directly you can just consume from those offsets.

Joel

On Thu, Sep 20, 2012 at 9:20 AM, Matthew Rathbone <matt...@foursquare.com>wrote:

> Hey guys,
>
> I've come across this behavior with the hadoop-consumer, but it certainly
> applies to any consumer.
>
> We've had our brokers up and running for about 9 days, with a 7-day
> retention policy. (3 brokers with 3 partitions each)
> I've just deployed a new hadoop consumer and wanted to read from the
> beginning of time (7-days ago).
>
> Here's the behavior I'm seeing:
> - I tell the consumer to start from 0
> - It queries the partition, finds the minimum available is 2000000, so it
> starts there
> - It starts consuming from 2000000+
> - It throws an exception ("kafka.common.OffsetOutOfRangeException") because
> that message was deleted already
>
> Through sheer luck, after a few task failures the job managed to beat this
> race condition, but it begs the question:
>
> - How would I tell a consumer to start querying from T-4days? That would
> totally solve the issue. I don't really need a full 7 days, but I have no
> way to resolve time -> offset
> (this is useful if people are tailing the events too, so they can tail
> events from 3 days ago grepping for something)
>
> Any ideas? Anyone else experienced this?
> --
> Matthew Rathbone
> Foursquare | Software Engineer | Server Engineering Team
> matt...@foursquare.com | @rathboma <http://twitter.com/rathboma> |
> 4sq<http://foursquare.com/rathboma>
>

Reply via email to