[
https://issues.apache.org/jira/browse/FLUME-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180674#comment-14180674
]
Pal Konyves edited comment on FLUME-2517 at 10/22/14 10:30 PM:
---------------------------------------------------------------
I think the other option is below, but then we also need to synchronize access
to the Map for thread-safety and visibility, maybe use a ConcurrentMap. The
problem with this is, that if many threads access the SimpleDateFormat object,
the mutex can be slow, although, I didn't make measurements, how heavy the
concurrent access is.
{code}
Map<String, SimpleDateFormat> cache;
SimpleDateFormat sdf = cache.get(formatString);
synchronized(sdf) {
sdf.format(formatString);
}
{code}
... I don't want to stick to the thread-local solution, I just would like to
avoid creating SimpleDateFormat object for every event.
was (Author: pkonyves):
I think the other option is below, but then we also need to synchronize access
to the Map for thread-safety and visibility, maybe use a ConcurrentMap. The
problem with this is, that if many threads access the SimpleDateFormat object,
the mutex can be slow, although, I didn't make measurements, how heavy the
concurrent access is.
{code}
Map<String, SimpleDateFormat> cache;
SimpleDateFormat sdf = cache.get(formatString);
synchronized(sdf) {
sdf.format(formatString);
}
{code}
> Performance issue: SimpleDateFormat constructor takes 30% of
> HDFSEventSink.process()
> ------------------------------------------------------------------------------------
>
> Key: FLUME-2517
> URL: https://issues.apache.org/jira/browse/FLUME-2517
> Project: Flume
> Issue Type: Bug
> Components: Sinks+Sources
> Affects Versions: v1.5.0.1
> Environment: linux i686
> java version "1.7.0_45"
> Reporter: Pal Konyves
> Labels: performance
> Attachments: flume_2517.patch, flume_2517.png
>
>
> I started investigating why HDFS sink has so bad throughput in v 1.5.0.0. It
> seems to be better in 1.6.0.0 (current trunk).
> PseudoTx channel was filling up, because HDFS Sink could not write as fast as
> data coming from source.
> Profiling from jconsole revealed that 30% of the time spent in
> HDFSEventSink.process() method is taken by constructing SimpleDateFormat
> objects. SimpleDateFormat object is notoriously a heavy and time consuming
> object to create. It is also not thread-safe.
> It is used in HDFS Sink to calculate the path that contains date-time
> wildcards. I will provide a patch to cache SimpleDateFormat objects for
> thread. With this patch, the PseudoTx channel I used for testing was not
> constantly filling up, and throughput was much better.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)