Hi Will,

Is the topic in question your change-log topic or the checkpoint-topic or
one of your inputs? (My understanding from reading this is its your

Can you please attach some more surrounding logs?


On Mon, Aug 20, 2018 at 6:16 AM, Will Schneider <wschnei...@tripadvisor.com>

> Hello all,
> We've recently been experiencing some Kafka/Samza issues we're not quite
> sure how to tackle. We've exhausted all our internal expertise and were
> hoping that someone on the mailing lists might have seen this before and
> knows what might cause it:
> KafkaSystemConsumer [WARN] While refreshing brokers for [Store_LogParser_
> RedactedMetadata_RedactedEnvironment,35]: 
> org.apache.kafka.common.errors.OffsetOutOfRangeException:
> The requested offset is not within the range of offsets maintained by the
> server.. Retrying.
> ^ (Above repeats indefinitely until we intervene)
> A bit about our use case:
>    - Versions:
>       - Kafka 1.0.1 (CDH Distribution 3.1.0-
>       - Samza 0.14.1
>       - Hadoop: 2.6.0-cdh5.12.1
>    - We've seen some manifestation of this error in 4 different
>    environments with minor differences in configuration, but all running the
>    same versions of the software
>       - Distributed Samza on Yarn (~10 node yarn environment, 3-7 node
>       kafka environment)
>       - Non-distributed virtual test environment (Samza on yarn, but with
>       no network in between)
>    - We have not found a reliable way to reproduce this error
>    - Issue typically presents on process startup. It usually doesn't make
>    a difference if the application was down for 5 minutes or 5 days before
>    that startup
>    - The LogParser application experiencing this issue is reading and
>    parsing a set of log files, and supplementing them with metadata stored in
>    the Store topic in question, and cached locally in RocksDB
>    - The LogParser application has 40-60 running tasks and partitions
>    depending on configuration
>    - There is no discernable pattern for where the error presents itself:
>       - It is not consistent WRT which yarn node hosts tasks with the
>       issue
>       - It is not consistent WRT which kafka node hosts the partitions
>       relevant to the issue
>       - The pattern does not persist with issue nodes upon consecutive
>       appearances of the error
>       - This leads us to believe the bug is probably endemic to the whole
>       cluster and not the result of a random hardware issue
>    - Offsets for the LogParser application are maintained in a samza
>    topic called something like:
>       - __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1
>       - Upon startup, checkpoints are refreshed from that topic, and
>    we'll see something in the log similar to:
>       - kafka.KafkaCheckpointManager [INFO] Read 6000 from topic:
>       __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1.
>       Current offset: 5999
>       - On more than one occasion, we have attempted to repair the job by
>       killing individual yarn containers and letting samza retry them.
>       - This will occasionally work. More frequently, it will get the
>          partition stuck in a loop trying to read from the __samza_checkpoint 
> topic
>          forever; we're suspicious that the retry loop above is storing 
> offsets one
>          or many times, causing the topic to fill up considerably.
>       - We are aware of only two workarounds:
>       - 1- Fully clearing out the data disks on the Kafka servers and
>       rebuilding the topics always seems to work, at least for a time.
>       - 2- We can use a setting like: streams.Store_LogParser_
>       RedactedMetadata_RedactedEnvironment.samza.reset.offset=true, which
>       will necessarily ignore the checkpoint topic, and not bother to validate
>       any offset on the Store.
>          - This works, but requires us to do a lengthy metadata refresh
>          immediately after startup, which is less than ideal.
>       - We have also seen this on rare occasion on other, smaller Samza
>    tiers
>       - In those cases, the common thread appears to be that the tier was
>       left down for a period of time longer than the Kafka retention timeout, 
> and
>       got stuck in the loop upon restart. Attempts at reproducing it this way
>       have been unsuccessful
>       - Worth adding that in this case, adding the samza.reset.offset
>       parameter to the configuration did not seem to have the intended effect
> On another possibly-related note, one of our clusters periodically throws
> an error like this, but usually recovers without intervention:
> KafkaSystemAdmin [WARN] Exception while trying to get offset for
> SystemStreamPartition [kafka, 
> Store_LogParser_RedactedMetadata_RedactedEnvironment,
> 32]: org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> server is not the leader for that topic-partition.. Retrying.
>    - We've seen this error message crop up when we've had issues with the
>    network in our datacenter, but we're not aware of any such issue at the
>    times when we're experiencing the bigger issue. We're not sure if that
>    might be related or not.
> Has anyone seen these errors before? Is there a known workaround or fix
> for it?
> Thanks for your help!
> Attached is a copy of the Samza configuration for the job in question, in
> case it contains more valuable information I may have missed.
> -Will Schneider

Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Reply via email to