[ 
https://issues.apache.org/jira/browse/FLUME-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pal Konyves updated FLUME-2517:
-------------------------------
    Description: 
I started investigating why HDFS sink has so bad throughput in v 1.5.0.0. It 
seems to be better in 1.6.0.0 (current trunk).

PseudoTx channel was filling up, because HDFS Sink could not write as fast as 
data coming from source.

Profiling from jconsole revealed that 30% of the time spent in 
HDFSEventSink.process() method is taken by constructing SimpleDateFormat 
objects. SimpleDateFormat object is notoriously a heavy and time consuming 
object to create. It is also not thread-safe.

It is used in HDFS Sink to calculate the path that contains date-time 
wildcards. I will provide a patch to cache SimpleDateFormat objects for thread. 
With this patch, the PseudoTx channel I used for testing was not constantly 
filling up, and throughput was much better.

  was:
I started investigating why HDFS sink has so bad throughput in v 1.5.0.0. It 
seems to be better in 1.6.0.0 (current trunk).

PseudoTx channel was filling up, because HDFS Sink could not write as fast as 
data coming from sink.

Profiling from jconsole revealed that 30% of the time spent in 
HDFSEventSink.process() method is taken by constructing SimpleDateFormat 
objects. SimpleDateFormat object is notoriously a heavy and time consuming 
object to create. It is also not thread-safe.

It is used in HDFS Sink to calculate the path that contains date-time 
wildcards. I will provide a patch to cache SimpleDateFormat objects for thread. 
With this patch, the PseudoTx channel I used for testing was not constantly 
filling up, and throughput was much better.


> Performance issue: SimpleDateFormat constructor takes 30% of 
> HDFSEventSink.process()
> ------------------------------------------------------------------------------------
>
>                 Key: FLUME-2517
>                 URL: https://issues.apache.org/jira/browse/FLUME-2517
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.5.0.1
>         Environment: linux i686
> java version "1.7.0_45"
>            Reporter: Pal Konyves
>              Labels: performance
>
> I started investigating why HDFS sink has so bad throughput in v 1.5.0.0. It 
> seems to be better in 1.6.0.0 (current trunk).
> PseudoTx channel was filling up, because HDFS Sink could not write as fast as 
> data coming from source.
> Profiling from jconsole revealed that 30% of the time spent in 
> HDFSEventSink.process() method is taken by constructing SimpleDateFormat 
> objects. SimpleDateFormat object is notoriously a heavy and time consuming 
> object to create. It is also not thread-safe.
> It is used in HDFS Sink to calculate the path that contains date-time 
> wildcards. I will provide a patch to cache SimpleDateFormat objects for 
> thread. With this patch, the PseudoTx channel I used for testing was not 
> constantly filling up, and throughput was much better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to