Hi,

With the information provided, these are the steps I can think of (based on
the experience I had with kafka):-

1. do a describe on the topic. See if the partitions and replicas are
evenly distributed amongst all. If not, you might want to try the 'Reassign
Partitions Tool' -
https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-6.ReassignPartitionsTool
2. is/are some partition(s) getting more data than others leading to an
imbalance of disk space amongst the nodes in the cluster, to an extent that
the kafka server process goes down on one or more machines in the cluster ?
3. From what I understand, your kafka and spark machines are the same ?? !!
how much memory usage the replica-0 has when your spark cluster is running
full throttle ?

Workaround -
Try running the Preferred Replica Leader Election Tool -
https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-1.PreferredReplicaLeaderElectionTool
to make some replica (the one that you noticed earlier when the cluster was
all good) as the leader for this partition

Regards,
Prabhjot

On Tue, Nov 24, 2015 at 7:11 AM, Gwen Shapira <g...@confluent.io> wrote:

> We fixed many many bugs since August. Since we are about to release 0.9.0
> (with SSL!), maybe wait a day and go with a released and tested version.
>
> On Mon, Nov 23, 2015 at 3:01 PM, Qi Xu <shkir...@gmail.com> wrote:
>
> > Forgot to mention is that the Kafka version we're using is from Aug's
> > Trunk branch---which has the SSL support.
> >
> > Thanks again,
> > Qi
> >
> >
> > On Mon, Nov 23, 2015 at 2:29 PM, Qi Xu <shkir...@gmail.com> wrote:
> >
> >> Loop another guy from our team.
> >>
> >> On Mon, Nov 23, 2015 at 2:26 PM, Qi Xu <shkir...@gmail.com> wrote:
> >>
> >>> Hi folks,
> >>> We have a 10 node cluster and have several topics. Each topic has about
> >>> 256 partitions with 3 replica factor. Now we run into an issue that in
> some
> >>> topic, a few partition (< 10)'s leader is -1 and all of them has only
> one
> >>> synced partition.
> >>>
> >>> From the Kafka manager, here's the snapshot:
> >>> [image: Inline image 2]
> >>>
> >>> [image: Inline image 1]
> >>>
> >>> here's the state log:
> >>> [2015-11-23 21:57:58,598] ERROR Controller 1 epoch 435499 initiated
> >>> state change for partition [userlogs,84] from OnlinePartition to
> >>> OnlinePartition failed (state.change.logger)
> >>> kafka.common.StateChangeFailedException: encountered error while
> >>> electing leader for partition [userlogs,84] due to: Preferred replica
> 0 for
> >>> partition [userlogs,84] is either not alive or not in the isr. Current
> >>> leader and ISR: [{"leader":-1,"leader_epoch":203,"isr":[1]}].
> >>> Caused by: kafka.common.StateChangeFailedException: Preferred replica 0
> >>> for partition [userlogs,84] is either not alive or not in the isr.
> Current
> >>> leader and ISR: [{"leader":-1,"leader_epoch":203,"isr":[1]}]
> >>>
> >>> My question is:
> >>> 1) how could this happen and how can I fix it or work around it?
> >>> 2) Is 256 partitions too big? We have about 200+ cores for spark
> >>> streaming job.
> >>>
> >>> Thanks,
> >>> Qi
> >>>
> >>>
> >>
> >
>



-- 
---------------------------------------------------------
"There are only 10 types of people in the world: Those who understand
binary, and those who don't"

Reply via email to