[jira] [Issue Comment Edited] (FLUME-1037) NETCAT handler theads terminate under stress test

Will McQueen (Issue Comment Edited) (JIRA) Wed, 21 Mar 2012 22:58:48 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235369#comment-13235369
 ]


Will McQueen edited comment on FLUME-1037 at 3/22/12 5:57 AM:
--------------------------------------------------------------

I now understand what's going on. NetcatSource is not creating one event per 
newline-separated text. Instead, it's creating one event per connection (the 
event body ends when the connection closes).

The NetcatSource javadoc says:
"A netcat-like source that listens on a given port and turns each line of text 
into an event....The expectation is that the supplied data is newline separated 
text. Each line of text is turned into a Flume event and sent via the connected 
channel."

The handler code (in NetcatSource.NetcatSocketHandler.run) that reads from the 
socket is here:
        while (reader.read(buffer) != -1) {
          buffer.flip();

          logger.debug("read {} characters", buffer.remaining());

          counterGroup.addAndGet("characters.received",
              Long.valueOf(buffer.limit()));

          builder.append(buffer.array(), buffer.position(), buffer.length());
        }

As you can see, the buffer (which is 512 bytes in size) is continuously used to 
read from the server port until the connection is EOL'd by the client. That 
means that the StringBuilder continues to grow without bound (assuming that you 
have a continuous input source such as with "yes | nc localhost 1473"). So no 
matter what you specify for -Xmx in flume-env.sh, you're guaranteed to 
eventually go Out Of Memory (which will sever the connection, and cause the 
source to restart due to the LifecyleSupervisor's AlwaysRestartPolicy, allowing 
a new connection over the same port).

In case you're interested in my approach, I started with a very large heap size 
and I used YourKit Java Profiler (there are other profilers out there too like 
JProfiler) to take a snapshot of memory, identify the objects with the largest 
retained sizes (one of which turned out to be a single instance of class 
StringBuilder), then filtered-out "unreachable but not yet collected" objects, 
sorted the list by "Retained Size", and looked at top 3 hotspots: char[], 
Thread, and StringBuilder. There was only 1 object of class StringBuilder, and 
it had a retained size of over 500MB! So I drilled-down into that object 
(highlighting it and choosing "Selected Objects"), which showed the value of 
the StringBuilder object to be a single char[] of over 250 million 'y' 
characters. Going to <Paths from GC Roots> tab and clicking "start calculation" 
showed that the huge char[] was held by a local variable called "builder" in a 
thread named "netcat-handler-0"... which brought me to the while loop I pasted 
earlier.

I initially used NULL for the sink, but if you replace that with LOGGER then 
you'll see that only a single event is output... and it's output only after the 
client closes the connection to the netcat source. I tried the LOGGER sink when 
Hari showed me that when he used netcat source on his Mac, only a single event 
was output to the logger after closing the netcat connection.
                
      was (Author: [email protected]):
    I now understand what's going on. NetcatSource is not creating one event 
per newline-separated text. Instead, it's creating one event per connection 
(the event body ends when the connection closes).

The NetcatSource javadoc says:
"A netcat-like source that listens on a given port and turns each line of text 
into an event....The expectation is that the supplied data is newline separated 
text. Each line of text is turned into a Flume event and sent via the connected 
channel."

The handler code (in NetcatSource.NetcatSocketHandler.run) that reads from the 
socket is here:
        while (reader.read(buffer) != -1) {
          buffer.flip();

          logger.debug("read {} characters", buffer.remaining());

          counterGroup.addAndGet("characters.received",
              Long.valueOf(buffer.limit()));

          builder.append(buffer.array(), buffer.position(), buffer.length());
        }

As you can see, the buffer (which is 512 bytes in size) is continuously used to 
read from the server port until the connection is EOL'd by the client. That 
means that the StringBuilder continues to grow without bound (assuming that you 
have a continuous input source such as with "yes | nc localhost 1473"). So no 
matter what you specify for -Xmx in flume-env.sh, you're guaranteed to 
eventually go Out Of Memory (which will sever the connection, and cause the 
source to restart due to the LifecyleSupervisor's AlwaysRestartPolicy, allowing 
a new connection over the same port).

In case you're interested in my approach, I started with a very large heap size 
and I used YourKit Java Profiler (there are other profilers out there too like 
JProfiler) to take a snapshot of memory, identify the objects with the largest 
retained sizes (one of which turned out to be a single instance of class 
StringBuilder), then filtered-out "unreachable but not yet collected" objects, 
sorted the list by "Retained Size", and looked at top 3 hotspots: char[], 
Thread, and StringBuilder. There was only 1 object of class StringBuilder, and 
it had a retained size of over 500MB! So I drilled-down into that object 
(highlighting it and choosing "Selected Objects"), which showed the value of 
the StringBuilder object to be a single char[] of over 250 million 'y' 
characters. Going to <Paths from GC Roots> tab and clicking "start calculation" 
showed that the huge char[] was held by a local variable called "builder" in a 
thread named "netcat-handler-0"... which brought me to the while loop I pasted 
earlier.

I initially used NULL for the sink, but if you replace that with LOGGER then 
you'll see that only a single event is output... and it's output only after the 
client closes the connection to the netcat source.
                  
> NETCAT handler theads terminate under stress test
> -------------------------------------------------
>
>                 Key: FLUME-1037
>                 URL: https://issues.apache.org/jira/browse/FLUME-1037
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.2.0
>         Environment: [CentOS 6.2 64-bit]
>            Reporter: Will McQueen
>
> Steps:
> 1. Use a props file such as the following:
> \# a = agent
> \# r = source
> \# c = channel
> \# k = sink
> a1.sources = r1
> a1.channels = c1
> a1.sinks = k1
> \# ===SOURCES===
> a1.sources.r1.type = NETCAT
> a1.sources.r1.channels = c1
> a1.sources.r1.bind = localhost
> a1.sources.r1.port = 1473
> \# ===CHANNELS===
> a1.channels.c1.type = MEMORY
> \# ===SINKS===
> a1.sinks.k1.type = NULL
> a1.sinks.k1.channel = c1
> 2. Set the FLUME_CONF_DIR to point to your conf dir
> [will@localhost flume-1.2.0-incubating-SNAPSHOT]$ export 
> FLUME_CONF_DIR=/home/will/git/apache/flume/flume-1.2.0-incubating-SNAPSHOT/conf
> 3. Create a flume-env.sh file
> [will@localhost flume-1.2.0-incubating-SNAPSHOT]$ cp 
> conf/flume-env.sh.template conf/flume-env.sh
> 4. Adjust the memory size within flume-env.sh (this file will be 
> automatically sourced when calling bin/flume-ng, but only if you've specified 
> the FLUME_CONF_DIR env var)
> (here, I went to the extreme and I set the min and max heap to 1GB. I also 
> specified a YourKit profiler agent)
> Sample contents of flume-env.sh:
> export JAVA_OPTS="-Xms1024m -Xmx1024m 
> -agentpath:/home/will/tools/yjp-10.0.6/bin/linux-x86-64/libyjpagent.so=tracing,noj2ee"
> 5. Run the flume NG agent:
> bin/flume-ng node --conf conf --conf-file conf/a1.properties --name a1
> 6. Open-up 10 terminal windows (on the same host) to connect to the netcat 
> server port. Sent continuous output in each terminal. I chose to use the 
> command:
>      yes | nc localhost 1473
> The "yes" unix cmd will continuously output 'y' char, followed by newline 
> char. If you use YourKit and go into the Threads view, you'll see that after 
> a while (possibly need to wait up to 10 mins) after a netcat handler thread 
> has continuously been alternating between Runnable and Blocked states 
> (blocking due to org.apache.klog4j.Category.log(..), but that's beside the 
> point), the netcat handler thread enters a continuous wait state for exactly 
> 1 minute, and then terminates (while its associated 'yes | nc localhost 1473' 
> command is still running).
> I haven't done further analysis. My first thought was a thread safety issue. 
> Note that there are no property file reconfigurations done during this test 
> -- I leave the props file alone.
> I welcome your ideas/comments. I initially ran this test with the default 
> -Xmx20m but it ran out of memory. For a future test I might lower the Xmx/Xms 
> from 1GB to maybe 128MB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (FLUME-1037) NETCAT handler theads terminate under stress test

Reply via email to