date:20121207

John Fung created KAFKA-663:
---

 Summary: Add deploy feature to System Test
 Key: KAFKA-663
 URL: https://issues.apache.org/jira/browse/KAFKA-663
 Project: Kafka
  Issue Type: Task
Reporter: John Fung
Assignee: John Fung




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (KAFKA-664) Kafka server threads die due to OOME during long running test

Neha Narkhede created KAFKA-664:
---

 Summary: Kafka server threads die due to OOME during long running 
test
 Key: KAFKA-664
 URL: https://issues.apache.org/jira/browse/KAFKA-664
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Priority: Blocker
 Fix For: 0.8


I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
running producer process that sends data to 100s of partitions continuously for 
~15 hours. After ~4 hours of operation, few server threads (acceptor and 
processor) exited due to OOME -

[2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
(kafka.network.BoundedByteBufferReceive)
java.lang.OutOfMemoryError: Java heap space
[2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 'kafka-acceptor': 
(kafka.utils.Utils$)
java.lang.OutOfMemoryError: Java heap space
[2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
'kafka-processor-9092-1': (kafka.utils.Utils$)
java.lang.OutOfMemoryError: Java heap space
[2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
session 0x13afd0753870103 has expired, closing socket connection 
(org.apache.zookeeper.ClientCnxn)
[2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
(org.I0Itec.zkclient.ZkClient)
[2012-12-07 08:24:46,344] INFO Initiating client connection, 
connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
 sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
(org.apache.zookeeper.ZooKeeper)
[2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
(kafka.network.BoundedByteBufferReceive)
java.lang.OutOfMemoryError: Java heap space
[2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
'kafka-request-handler-0': (kafka.utils.Utils$)
java.lang.OutOfMemoryError: Java heap space
[2012-12-07 08:25:08,739] INFO Opening socket connection to server 
eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
[2012-12-07 08:25:14,221] INFO Socket connection established to 
eat1-app311.corp/172.20.72.75:12913, initiating session 
(org.apache.zookeeper.ClientCnxn)
[2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
server in 3722ms for sessionid 0x0, closing socket connection and attempting 
reconnect (org.apache.zookeeper.ClientCnxn)
[2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
java.lang.OutOfMemoryError: Java heap space
[2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
(kafka.network.BoundedByteBufferReceive)
java.lang.OutOfMemoryError: Java heap space


It seems like it runs out of memory while trying to read the producer request, 
but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test


[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13526590#comment-13526590
 ] 

Neha Narkhede commented on KAFKA-664:
-

Another observation - The server is probably GCing quite a lot, since I see the 
following in the server logs -

[2012-12-07 09:32:14,742] INFO Client session timed out, have not heard from 
server in 1204905ms for sessionid 0x23afd074d6600ea, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn)

The zookeeper session timeout is pretty high (15secs) and it is in the same DC 
as the Kafka cluster and the producer

 Kafka server threads die due to OOME during long running test
 -

 Key: KAFKA-664
 URL: https://issues.apache.org/jira/browse/KAFKA-664
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Priority: Blocker
  Labels: bugs
 Fix For: 0.8

 Attachments: thread-dump.log


 I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
 running producer process that sends data to 100s of partitions continuously 
 for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
 processor) exited due to OOME -
 [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-acceptor': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-processor-9092-1': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
 session 0x13afd0753870103 has expired, closing socket connection 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
 (org.I0Itec.zkclient.ZkClient)
 [2012-12-07 08:24:46,344] INFO Initiating client connection, 
 connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
 (org.apache.zookeeper.ZooKeeper)
 [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
 'kafka-request-handler-0': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
 eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:14,221] INFO Socket connection established to 
 eat1-app311.corp/172.20.72.75:12913, initiating session 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
 server in 3722ms for sessionid 0x0, closing socket connection and attempting 
 reconnect (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 It seems like it runs out of memory while trying to read the producer 
 request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-651) Create testcases on auto create topics


 [ 
https://issues.apache.org/jira/browse/KAFKA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Fung updated KAFKA-651:


Status: Patch Available  (was: Open)

 Create testcases on auto create topics
 --

 Key: KAFKA-651
 URL: https://issues.apache.org/jira/browse/KAFKA-651
 Project: Kafka
  Issue Type: Task
Reporter: John Fung
  Labels: replication-testing
 Attachments: kafka-651-v1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-651) Create testcases on auto create topics


 [ 
https://issues.apache.org/jira/browse/KAFKA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Fung updated KAFKA-651:


Attachment: kafka-651-v1.patch

Uploaded kafka-651-v1.patch with 1 testcase to cover each functional group:
testcase_0011
testcase_0024
testcase_0119
testcase_0128
testcase_0134
testcase_0159
testcase_0209
testcase_0259
testcase_0309

 Create testcases on auto create topics
 --

 Key: KAFKA-651
 URL: https://issues.apache.org/jira/browse/KAFKA-651
 Project: Kafka
  Issue Type: Task
Reporter: John Fung
  Labels: replication-testing
 Attachments: kafka-651-v1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test


[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13526719#comment-13526719
 ] 

Neha Narkhede commented on KAFKA-664:
-

Heap dump is here - 
http://people.apache.org/~nehanarkhede/kafka-misc/kafka-0.8/heap-dump.tar.gz
Almost all the largest objects trace back to 
RequestPurgatory$ExpiredRequestReaper as the GC root.

 Kafka server threads die due to OOME during long running test
 -

 Key: KAFKA-664
 URL: https://issues.apache.org/jira/browse/KAFKA-664
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Priority: Blocker
  Labels: bugs
 Fix For: 0.8

 Attachments: thread-dump.log


 I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
 running producer process that sends data to 100s of partitions continuously 
 for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
 processor) exited due to OOME -
 [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-acceptor': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-processor-9092-1': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
 session 0x13afd0753870103 has expired, closing socket connection 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
 (org.I0Itec.zkclient.ZkClient)
 [2012-12-07 08:24:46,344] INFO Initiating client connection, 
 connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
 (org.apache.zookeeper.ZooKeeper)
 [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
 'kafka-request-handler-0': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
 eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:14,221] INFO Socket connection established to 
 eat1-app311.corp/172.20.72.75:12913, initiating session 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
 server in 3722ms for sessionid 0x0, closing socket connection and attempting 
 reconnect (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 It seems like it runs out of memory while trying to read the producer 
 request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test


[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13526720#comment-13526720
 ] 

Neha Narkhede commented on KAFKA-664:
-

I'm re-running the tests with that option now

 Kafka server threads die due to OOME during long running test
 -

 Key: KAFKA-664
 URL: https://issues.apache.org/jira/browse/KAFKA-664
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Priority: Blocker
  Labels: bugs
 Fix For: 0.8

 Attachments: thread-dump.log


 I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
 running producer process that sends data to 100s of partitions continuously 
 for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
 processor) exited due to OOME -
 [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-acceptor': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-processor-9092-1': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
 session 0x13afd0753870103 has expired, closing socket connection 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
 (org.I0Itec.zkclient.ZkClient)
 [2012-12-07 08:24:46,344] INFO Initiating client connection, 
 connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
 (org.apache.zookeeper.ZooKeeper)
 [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
 'kafka-request-handler-0': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
 eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:14,221] INFO Socket connection established to 
 eat1-app311.corp/172.20.72.75:12913, initiating session 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
 server in 3722ms for sessionid 0x0, closing socket connection and attempting 
 reconnect (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 It seems like it runs out of memory while trying to read the producer 
 request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test


[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13526808#comment-13526808
 ] 

Neha Narkhede commented on KAFKA-664:
-

The root cause seems to be that watchersForKey map keeps growing. I see that we 
add keys to the map, but never actually delete them.

 Kafka server threads die due to OOME during long running test
 -

 Key: KAFKA-664
 URL: https://issues.apache.org/jira/browse/KAFKA-664
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Jay Kreps
Priority: Blocker
  Labels: bugs
 Fix For: 0.8

 Attachments: thread-dump.log


 I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
 running producer process that sends data to 100s of partitions continuously 
 for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
 processor) exited due to OOME -
 [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-acceptor': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-processor-9092-1': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
 session 0x13afd0753870103 has expired, closing socket connection 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
 (org.I0Itec.zkclient.ZkClient)
 [2012-12-07 08:24:46,344] INFO Initiating client connection, 
 connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
 (org.apache.zookeeper.ZooKeeper)
 [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
 'kafka-request-handler-0': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
 eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:14,221] INFO Socket connection established to 
 eat1-app311.corp/172.20.72.75:12913, initiating session 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
 server in 3722ms for sessionid 0x0, closing socket connection and attempting 
 reconnect (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 It seems like it runs out of memory while trying to read the producer 
 request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-644) System Test should run properly with mixed File System Pathname


[ 
https://issues.apache.org/jira/browse/KAFKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13526851#comment-13526851
 ] 

Neha Narkhede commented on KAFKA-644:
-

+1. Thanks for the patch !

 System Test should run properly with mixed File System Pathname
 ---

 Key: KAFKA-644
 URL: https://issues.apache.org/jira/browse/KAFKA-644
 Project: Kafka
  Issue Type: Task
Reporter: John Fung
Assignee: John Fung
  Labels: replication-testing
 Attachments: kafka-644-v1.patch


 Currently, System Test assumes that all the entities (ZK, Broker, Producer, 
 Consumer) are running in machines which have the same File System Pathname as 
 the machine in which the System Test scripts are running.
 Usually, our own local boxes would be like /home/kafka/. . .
 and remote boxes may look like /mnt/. . .
 In this case, System Test won't work properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (KAFKA-644) System Test should run properly with mixed File System Pathname


 [ 
https://issues.apache.org/jira/browse/KAFKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neha Narkhede closed KAFKA-644.
---


 System Test should run properly with mixed File System Pathname
 ---

 Key: KAFKA-644
 URL: https://issues.apache.org/jira/browse/KAFKA-644
 Project: Kafka
  Issue Type: Task
Reporter: John Fung
Assignee: John Fung
  Labels: replication-testing
 Attachments: kafka-644-v1.patch


 Currently, System Test assumes that all the entities (ZK, Broker, Producer, 
 Consumer) are running in machines which have the same File System Pathname as 
 the machine in which the System Test scripts are running.
 Usually, our own local boxes would be like /home/kafka/. . .
 and remote boxes may look like /mnt/. . .
 In this case, System Test won't work properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-597) Refactor KafkaScheduler


 [ 
https://issues.apache.org/jira/browse/KAFKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Kreps updated KAFKA-597:


Attachment: KAFKA-597-v4.patch

Patch v4. 
- Rebased
- Makes use of thread factory
- Fixed broken scaladoc

 Refactor KafkaScheduler
 ---

 Key: KAFKA-597
 URL: https://issues.apache.org/jira/browse/KAFKA-597
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8.1
Reporter: Jay Kreps
Priority: Minor
 Attachments: KAFKA-597-v1.patch, KAFKA-597-v2.patch, 
 KAFKA-597-v3.patch, KAFKA-597-v4.patch


 It would be nice to cleanup KafkaScheduler. Here is what I am thinking
 Extract the following interface:
 trait Scheduler {
   def startup()
   def schedule(fun: () = Unit, name: String, delayMs: Long = 0, periodMs: 
 Long): Scheduled
   def shutdown(interrupt: Boolean = false)
 }
 class Scheduled {
   def lastExecution: Long
   def cancel()
 }
 We would have two implementations, KafkaScheduler and  MockScheduler. 
 KafkaScheduler would be a wrapper for ScheduledThreadPoolExecutor. 
 MockScheduler would only allow manual time advancement rather than using the 
 system clock, we would switch unit tests over to this.
 This change would be different from the existing scheduler in a the following 
 ways:
 1. Would not return a ScheduledFuture (since this is useless)
 2. shutdown() would be a blocking call. The current shutdown calls, don't 
 really do what people want.
 3. We would remove the daemon thread flag, as I don't think it works.
 4. It returns an object which let's you cancel the job or get the last 
 execution time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Joel Koshy (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13526876#comment-13526876
 ] 

Joel Koshy commented on KAFKA-664:
--

To clarify, the map itself shouldn't grow indefinitely right? - i.e., if there 
are no new partitions the number of keys should be the same. I think the issue 
is that expired requests (for a key) are not removed from the list of 
outstanding requests for that key.

 Kafka server threads die due to OOME during long running test
 -

 Key: KAFKA-664
 URL: https://issues.apache.org/jira/browse/KAFKA-664
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Jay Kreps
Priority: Blocker
  Labels: bugs
 Fix For: 0.8

 Attachments: thread-dump.log


 I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
 running producer process that sends data to 100s of partitions continuously 
 for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
 processor) exited due to OOME -
 [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-acceptor': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-processor-9092-1': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
 session 0x13afd0753870103 has expired, closing socket connection 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
 (org.I0Itec.zkclient.ZkClient)
 [2012-12-07 08:24:46,344] INFO Initiating client connection, 
 connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
 (org.apache.zookeeper.ZooKeeper)
 [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
 'kafka-request-handler-0': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
 eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:14,221] INFO Socket connection established to 
 eat1-app311.corp/172.20.72.75:12913, initiating session 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
 server in 3722ms for sessionid 0x0, closing socket connection and attempting 
 reconnect (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 It seems like it runs out of memory while trying to read the producer 
 request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-636) Make log segment delete asynchronous

[
https://issues.apache.org/jira/browse/KAFKA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jay Kreps updated KAFKA-636:

Attachment: KAFKA-636-v1.patch

This patch implements asynchronous delete in the log.

To do this Log.scala now requires a scheduler to be used for scheduling the
deletions.

The deletion works as described above.

The locking for segment deletion can now be more aggressive since the file
renames are assumed to be fast they can be inside the lock.

As part of testing this I also found a problem with MockScheduler, namely that
it does not reentrant. That is, if scheduled tasks themselves create scheduled
tasks it misbehaves. To fix this I rewrote MockScheduler to use a priority
queue. The code is simpler and more correct since it now performs all
executions in the correct order too.

Make log segment delete asynchronous

Key: KAFKA-636
URL: https://issues.apache.org/jira/browse/KAFKA-636
Project: Kafka
Issue Type: Bug
Reporter: Jay Kreps
Assignee: Jay Kreps
Attachments: KAFKA-636-v1.patch

We have a few corner-case bugs around delete of segment files:
1. It is possible for delete and truncate to kind of cross streams and end up
with a case where you have no segments.
2. Reads on the log have no locking (which is good) but as a result deleting
a segment that is being read will result in some kind of I/O exception.
3. We can't easily fix the synchronization problems without deleting files
inside the log's write lock. This can be a problem as deleting a 2GB segment
can take a couple of seconds even on an unloaded system.
The proposed fix for these problems is to make file removal asynchronous
using the following scheme as the new delete scheme:
1. Immediately remove the file from segment map and rename the file from X to
X.deleted (e.g. 000.log to 00.log.deleted. We think renaming a file
will not impact reads since the file is already open and hence the name is
irrelevant. This will always be O(1) and can be done inside the write lock.
2. Schedule a future operation to delete the file. The time to wait would be
configurable but we would just default it to 60 seconds and probably no one
would ever change it.
3. On startup we would delete any files with the .deleted suffix as they
would have been pending deletes that didn't take place.
I plan to do this soon working against the refactored log (KAFKA-521). We can
opt to back port the patch for 0.8 if we are feeling daring.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Joel Koshy (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13526883#comment-13526883
 ] 

Joel Koshy commented on KAFKA-664:
--

Okay I'm slightly confused. Even on expiration the request is marked as 
satisfied. So even if it is not removed from the watcher's list during 
expiration it will be removed on the next call to collectSatisfiedRequests - 
which in this case will be when the next produce request arrives to that 
partition. Which means this should only be due to low-volume partitions that 
are no longer growing. i.e., the replica fetcher would keep issuing fetch 
requests that keep expiring but never get removed from the list of pending 
requests in watchersFor(the-low-volume-partition).

 Kafka server threads die due to OOME during long running test
 -

 Key: KAFKA-664
 URL: https://issues.apache.org/jira/browse/KAFKA-664
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Jay Kreps
Priority: Blocker
  Labels: bugs
 Fix For: 0.8

 Attachments: thread-dump.log


 I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
 running producer process that sends data to 100s of partitions continuously 
 for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
 processor) exited due to OOME -
 [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-acceptor': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-processor-9092-1': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
 session 0x13afd0753870103 has expired, closing socket connection 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
 (org.I0Itec.zkclient.ZkClient)
 [2012-12-07 08:24:46,344] INFO Initiating client connection, 
 connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
 (org.apache.zookeeper.ZooKeeper)
 [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
 'kafka-request-handler-0': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
 eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:14,221] INFO Socket connection established to 
 eat1-app311.corp/172.20.72.75:12913, initiating session 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
 server in 3722ms for sessionid 0x0, closing socket connection and attempting 
 reconnect (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 It seems like it runs out of memory while trying to read the producer 
 request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test


[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13526887#comment-13526887
 ] 

Jay Kreps commented on KAFKA-664:
-

Another issue is that we are saving the full producer request in memory for as 
long as it is in purgatory. Not sure that is causing this, but that is pretty 
bad.

 Kafka server threads die due to OOME during long running test
 -

 Key: KAFKA-664
 URL: https://issues.apache.org/jira/browse/KAFKA-664
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Jay Kreps
Priority: Blocker
  Labels: bugs
 Fix For: 0.8

 Attachments: thread-dump.log


 I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
 running producer process that sends data to 100s of partitions continuously 
 for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
 processor) exited due to OOME -
 [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-acceptor': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-processor-9092-1': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
 session 0x13afd0753870103 has expired, closing socket connection 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
 (org.I0Itec.zkclient.ZkClient)
 [2012-12-07 08:24:46,344] INFO Initiating client connection, 
 connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
 (org.apache.zookeeper.ZooKeeper)
 [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
 'kafka-request-handler-0': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
 eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:14,221] INFO Socket connection established to 
 eat1-app311.corp/172.20.72.75:12913, initiating session 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
 server in 3722ms for sessionid 0x0, closing socket connection and attempting 
 reconnect (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 It seems like it runs out of memory while trying to read the producer 
 request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-644) System Test should run properly with mixed File System Pathname


 [ 
https://issues.apache.org/jira/browse/KAFKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Fung updated KAFKA-644:


Attachment: kafka-644-v2.patch

Uploaded kafka-644-v2.patch which supports the property auto_create_topic

 System Test should run properly with mixed File System Pathname
 ---

 Key: KAFKA-644
 URL: https://issues.apache.org/jira/browse/KAFKA-644
 Project: Kafka
  Issue Type: Task
Reporter: John Fung
Assignee: John Fung
  Labels: replication-testing
 Attachments: kafka-644-v1.patch, kafka-644-v2.patch


 Currently, System Test assumes that all the entities (ZK, Broker, Producer, 
 Consumer) are running in machines which have the same File System Pathname as 
 the machine in which the System Test scripts are running.
 Usually, our own local boxes would be like /home/kafka/. . .
 and remote boxes may look like /mnt/. . .
 In this case, System Test won't work properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-646) Provide aggregate stats at the high level Producer and ZookeeperConsumerConnector level

2012-12-07 Thread Swapnil Ghike (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swapnil Ghike updated KAFKA-646:


Attachment: kafka-646-patch-num1-v1.patch

This patch has a bunch of refactoring changes and a couple of new additions. 

Addressing Jun's comments: 
These are all great catches! Thanks for being so thorough.

60. By default, metrics-core will return an existing metric object of the same 
name using a getOrCreate() like functionality. As discussed offline, we should 
fail the clients that use an already registered clientId name. We will need to 
create two objects thaty contain hashmaps to record the existing producer and 
consumer clientIds and methods to throw an exception if a client attempts to 
use an existing clientId. I worked on this change a bit, but it breaks a lot of 
our unit tests (about half) and the refactoring will take some time. Hence, I 
think it will be better if I submit a patch for all other changes and create 
another patch for this issue under this jira. Until then we can keep this jira 
open.

61. For recording stats about all topics, I am now using a string All.Topics. 
Since '.' is not allowed in the legal character set for topic names, this will 
differentiate from a topic named AllTopics.

62. Yes, we should validate groupId. Added the functionality and a unit test. 
It has the same validation rules as ClientId.

63. A metric name is something like (clientId + topic + some string) and this 
entire string is limited by fillename size. We already allow topic name to be 
at most 255 bytes long. We could fix max lengths for each of clientId, groupId, 
topic name so that the metric name never exceeds filename size. But those 
lengths will be quite arbitrary, perhaps we should skip the check on the length 
of clientId and groupId. 

64. Removed brokerInfo from the clientId used to instantiate 
FetchRequestBuilder.


Refactoring: 
1. Moved validation of clientId at the end of instantiation of ProducerConfig 
and ConsumerConfig. 
- Created static objects ProducerConfig and ConsumerConfig which contain a 
validate() method.

2. Created global *Registry objects in which each high level Producer and 
Consumer can register their *stats objects.
- These objects are registered in the static object only once using 
utils.Pool.getAndMaybePut functionality. 
- This will remove the need to pass *stats objects around the code in 
constructors (I thought having the metrics objects right up in the constructors 
was a bit intrusive, since one doesn't quite always think about the monitoring 
mechanism while instantiating various modules of the program, for example while 
unit testing.)
- Instead of the constructor, each concerned class obtains the *Stats objects 
from the global registry object.
- This cleans up any metrics objects created in the unit tests.
- Special mention: The producer constructors are back to the old themselves. 
With clientId validation moved to *Config objects, the intermediate Producer 
constructor that merely separated the parameters of a quadruplet is gone.

3. Created separate files
-  for ProducerStats, ProducerTopicStats, ProducerRequestStats in 
kafka.producer package and for FetchRequestAndResponseStats in kafka.consumer 
package. Thought it was appropriate given that we already had 
ConsumerTopicStats in a separate file, and since the code for metrics had 
increased in size due to addition of *Registry and Aggregated* objects. Added 
comments.
- for objects Topic, ClientId and GroupId in kafka.utils package.
- to move the helper case classes ClientIdAndTopic, ClientIdAndBroker to 
kafka.common package. 

4. Renamed a few variables to easier names (anyOldName to metricId change).


New additions: 
1. Added two objects to aggregate metrics recorded by SyncProducers and 
SimpleConsumers at the high level Producer and Consumer. 
- For this, changed KafkaTimer to accept a list of Timers. Typically we will 
pass a specificTimer and a globalTimer to this KafkaTimer class. Created a new 
KafkaHistogram in a similar way.

2. Validation of groupId.


Issues:
1. Initializing the aggregator metrics with default values: For example, let's 
say that a syncProducer could be created (which will register a 
ProducerRequestStats mbean for this syncProducer). However, if no request is 
sent by this syncProducer then the absense of its data is not reflected in the 
aggregator histogram. For instance, the min requestSize for the syncProducer 
that never sent a request will be 0, but this won't be accurately represented 
in the aggregator histogram. Thus, we need to understand that if the request 
count of a syncProducer is 0, then its data will not be accurately reflected in 
the aggregator histogram.

The question is whether it is possible to inform the aggregator histogram of 
some default values without increasing the request count of any syncProducer or 
the aggregated stats.


Further

[jira] [Updated] (KAFKA-664) Kafka server threads die due to OOME during long running test


 [ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neha Narkhede updated KAFKA-664:


Attachment: watchersForKey.png
kafka-664-draft.patch

The problem was ever increasing requests in the watchersForKey map. Please look 
at the graph attached.
This can happen for very low volume topics since the replica fetcher requests 
keep entering this map, and since there are no more produce requests coming for 
those topics/partitions, no one ever removes those requests from the map. 

With Joel's help, hacked RequestPurgatory to force the cleanup of 
expired/satisfied requests by the expiry thread inside purgeSatisfied. Of 
course, a better solution is re-designing the purgatory data structure to point 
from the queue to the map, but that is a bigger change. I just want to get 
around this issue and continue performance testing.


 Kafka server threads die due to OOME during long running test
 -

 Key: KAFKA-664
 URL: https://issues.apache.org/jira/browse/KAFKA-664
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Jay Kreps
Priority: Blocker
  Labels: bugs
 Fix For: 0.8

 Attachments: kafka-664-draft.patch, thread-dump.log, 
 watchersForKey.png


 I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
 running producer process that sends data to 100s of partitions continuously 
 for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
 processor) exited due to OOME -
 [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-acceptor': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-processor-9092-1': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
 session 0x13afd0753870103 has expired, closing socket connection 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
 (org.I0Itec.zkclient.ZkClient)
 [2012-12-07 08:24:46,344] INFO Initiating client connection, 
 connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
 (org.apache.zookeeper.ZooKeeper)
 [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
 'kafka-request-handler-0': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
 eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:14,221] INFO Socket connection established to 
 eat1-app311.corp/172.20.72.75:12913, initiating session 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
 server in 3722ms for sessionid 0x0, closing socket connection and attempting 
 reconnect (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 It seems like it runs out of memory while trying to read the producer 
 request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (KAFKA-664) Kafka server threads die due to OOME during long running test


[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13526973#comment-13526973
 ] 

Neha Narkhede edited comment on KAFKA-664 at 12/8/12 1:56 AM:
--

The problem was ever increasing requests in the watchersForKey map. Please look 
at the graph attached. In merely 40 minutes of running the broker, the number 
of requests in the purgatory map shot upto 4 million.
This can happen for very low volume topics since the replica fetcher requests 
keep entering this map, and since there are no more produce requests coming for 
those topics/partitions, no one ever removes those requests from the map. 

With Joel's help, hacked RequestPurgatory to force the cleanup of 
expired/satisfied requests by the expiry thread inside purgeSatisfied. Of 
course, a better solution is re-designing the purgatory data structure to point 
from the queue to the map, but that is a bigger change. I just want to get 
around this issue and continue performance testing.


  was (Author: nehanarkhede):
The problem was ever increasing requests in the watchersForKey map. Please 
look at the graph attached.
This can happen for very low volume topics since the replica fetcher requests 
keep entering this map, and since there are no more produce requests coming for 
those topics/partitions, no one ever removes those requests from the map. 

With Joel's help, hacked RequestPurgatory to force the cleanup of 
expired/satisfied requests by the expiry thread inside purgeSatisfied. Of 
course, a better solution is re-designing the purgatory data structure to point 
from the queue to the map, but that is a bigger change. I just want to get 
around this issue and continue performance testing.

  
 Kafka server threads die due to OOME during long running test
 -

 Key: KAFKA-664
 URL: https://issues.apache.org/jira/browse/KAFKA-664
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Jay Kreps
Priority: Blocker
  Labels: bugs
 Fix For: 0.8

 Attachments: kafka-664-draft.patch, thread-dump.log, 
 watchersForKey.png


 I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
 running producer process that sends data to 100s of partitions continuously 
 for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
 processor) exited due to OOME -
 [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-acceptor': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
 'kafka-processor-9092-1': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
 session 0x13afd0753870103 has expired, closing socket connection 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
 (org.I0Itec.zkclient.ZkClient)
 [2012-12-07 08:24:46,344] INFO Initiating client connection, 
 connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
 (org.apache.zookeeper.ZooKeeper)
 [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
 'kafka-request-handler-0': (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
 eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:14,221] INFO Socket connection established to 
 eat1-app311.corp/172.20.72.75:12913, initiating session 
 (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
 server in 3722ms for sessionid 0x0, closing socket connection and attempting 
 reconnect (org.apache.zookeeper.ClientCnxn)
 [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
 java.lang.OutOfMemoryError: Java heap space
 [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
 (kafka.network.BoundedByteBufferReceive)
 java.lang.OutOfMemoryError: Java heap space
 It seems like it runs out of memory while trying to read the producer 
 request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your

[jira] [Updated] (KAFKA-597) Refactor KafkaScheduler