Liang Zhou created FLUME-3221:
---------------------------------
Summary: using spooling dir and hdfs sink(hdfs.codec = lzop),
found data loss when set hdfs.filePrefix = %{basename}
Key: FLUME-3221
URL: https://issues.apache.org/jira/browse/FLUME-3221
Project: Flume
Issue Type: Bug
Components: Sinks+Sources
Affects Versions: 1.8.0
Environment: Java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
Hadoop 2.6.3
lzop native lib: hadoop-lzo-0.4.20-SNAPSHOT.jar
Reporter: Liang Zhou
Fix For: notrack
Attachments: flume_shutdown.log
a flume process configured with the following parameters cause this problem:
*Configuration*
{code:java}
spool_flume1.sources = spool-source-spool
spool_flume1.channels = hdfs-channel-spool
spool_flume1.sinks = hdfs-sink-spool
spool_flume1.sources.spool-source-spool.type = spooldir
spool_flume1.sources.spool-source-spool.channels = hdfs-channel-spool
spool_flume1.sources.spool-source-spool.spoolDir = /home/test/flume_log
spool_flume1.sources.spool-source-spool.recursiveDirectorySearch = true
spool_flume1.sources.spool-source-spool.fileHeader = true
spool_flume1.sources.spool-source-spool.deserializer = LINE
spool_flume1.sources.spool-source-spool.deserializer.maxLineLength = 100000000
spool_flume1.sources.spool-source-spool.inputCharset = UTF-8
spool_flume1.sources.spool-source-spool.basenameHeader = true
spool_flume1.channels.hdfs-channel-spool.type = memory
spool_flume1.channels.hdfs-channel-spool.keep-alive = 60
spool_flume1.sinks.hdfs-sink-spool.channel = hdfs-channel-spool
spool_flume1.sinks.hdfs-sink-spool.type = hdfs
spool_flume1.sinks.hdfs-sink-spool.hdfs.writeFormat = Text
spool_flume1.sinks.hdfs-sink-spool.hdfs.fileType = CompressedStream
spool_flume1.sinks.hdfs-sink-spool.hdfs.codeC = lzop
spool_flume1.sinks.hdfs-sink-spool.hdfs.threadsPoolSize = 1
spool_flume1.sinks.hdfs-sink-spool.hdfs.callTimeout = 100000
spool_flume1.sinks.hdfs-sink-spool.hdfs.idleTimeout = 36
spool_flume1.sinks.hdfs-sink-spool.hdfs.useLocalTimeStamp = true
spool_flume1.sinks.hdfs-sink-spool.hdfs.filePrefix = %{basename}
spool_flume1.sinks.hdfs-sink-spool.hdfs.path = /user/test/flume_test
spool_flume1.sinks.hdfs-sink-spool.hdfs.rollCount = 0
spool_flume1.sinks.hdfs-sink-spool.hdfs.rollSize = 134217728
spool_flume1.sinks.hdfs-sink-spool.hdfs.rollInterval = 0
spool_flume1.sources.spool-source-spool.includePattern = log.*-1_2018.*$
spool_flume1.sources.spool-source-spool.batchSize = 100
spool_flume1.channels.hdfs-channel-spool.capacity = 1000
spool_flume1.channels.hdfs-channel-spool.transactionCapacity = 100
{code}
test data size add up to 4.2 G, amounts to 5271962 lines
expected data stored as lzop format and named files as
%\{basename}_%\{LocalTimeStamp} on hdfs.
However, found sink {color:#FF0000}data mixed in different files{color} in my
tests and {color:#FF0000}total uploaded data amounts is less than local
data{color}
our test cases listed below:
* using DataStream, no matter set filePrefix = %\{basename} or not, uploading
normally
* using CompressedStream, hdfs.codec = lzop
** set filePrefix as default, uploading normally
** set filePrefix = %\{basename}, data mixed and loss
when shut down my flume agent process, it`s weird that flume.log print correct
amounts but actually uploaded data is not that much. Log file attached in the
end.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]