Bharad Tirumala created KAFKA-5038:
--------------------------------------

             Summary: running multiple kafka streams instances causes one or 
more instance to get into file contention
                 Key: KAFKA-5038
                 URL: https://issues.apache.org/jira/browse/KAFKA-5038
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 0.10.2.0
         Environment: 3 Kafka broker machines and 3 kafka streams machines.
Each machine is Linux 64 bit, CentOS 6.5 with 64GB memory, 8 vCPUs running in 
AWS
31GB java heap space allocated to each KafkaStreams instance and 4GB allocated 
to each Kafka broker.

            Reporter: Bharad Tirumala
             Fix For: 0.10.2.0


Having multiple kafka streams application instances causes one or more 
instances to get get into file lock contention and the instance(s) become 
unresponsive with uncaught exception.
The exception is below:
22:14:37.621 [StreamThread-7] WARN  o.a.k.s.p.internals.StreamThread - 
Unexpected state transition from RUNNING to NOT_RUNNING
22:14:37.621 [StreamThread-13] WARN  o.a.k.s.p.internals.StreamThread - 
Unexpected state transition from RUNNING to NOT_RUNNING
22:14:37.623 [StreamThread-18] WARN  o.a.k.s.p.internals.StreamThread - 
Unexpected state transition from RUNNING to NOT_RUNNING
22:14:37.625 [StreamThread-7] ERROR n.a.a.k.t.KStreamTopologyBase - Uncaught 
Exception:org.apache.kafka.streams.errors.ProcessorStateException: task 
directory [/data/kafka-streams/rtp-kstreams-metrics/0_119] doesn't exist and 
couldn't be created
        at 
org.apache.kafka.streams.processor.internals.StateDirectory.directoryForTask(StateDirectory.java:75)
        at 
org.apache.kafka.streams.processor.internals.StateDirectory.lock(StateDirectory.java:102)
        at 
org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:205)
        at 
org.apache.kafka.streams.processor.internals.StreamThread.maybeClean(StreamThread.java:753)
        at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:664)
        at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368)

This happens within couple of minutes after the instances are up and there is 
NO data being sent to the broker yet and the streams app is started with 
auto.offset.reset set to "latest".

Please note that there are no permissions or capacity issues. This may have 
nothing to do with number of instances, but I could easily reproduce it when 
I've 3 stream instances running. This is similar to the (and may be the same) 
bug as [KAFKA-3758]

Here are some relevant configuration info:
3 kafka brokers have one topic with 128 partitions and 1 replication
3 kafka streams applications (running on 3 machines) have a single processor 
topology and this processor is not doing anything (the process() method just 
returns and the punctuate method just commits)
There is no data flowing yet, so the process() and puctuate() methods are not 
even called yet.
The 3 kafka stream instances have 43, 43 and 42 threads each respectively 
(totally making up to 128 threads, so one task per thread distributed across 
three streams instances on 3 machines).

Here are the configurations that I'd played around with:
session.timeout.ms=300000
heartbeat.interval.ms=60000
max.poll.records=100
num.standby.replicas=1
commit.interval.ms=10000
poll.ms=100

When punctuate is scheduled to be called every 1000ms or 3000ms, the problem 
happens every time. If punctuate is scheduled for 5000ms, I didn't see the 
problem in my test scenario (described above), but it happened in my real 
application. But this may have nothing to do with the issue, since punctuate is 
not even called as there are no messages streaming through yet.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to