Liang Zhou created FLUME-3221:
---------------------------------

             Summary: using spooling dir and hdfs sink(hdfs.codec = lzop), 
found data loss when set hdfs.filePrefix = %{basename}
                 Key: FLUME-3221
                 URL: https://issues.apache.org/jira/browse/FLUME-3221
             Project: Flume
          Issue Type: Bug
          Components: Sinks+Sources
    Affects Versions: 1.8.0
         Environment: Java version "1.8.0_151"

Java(TM) SE Runtime Environment (build 1.8.0_151-b12)

Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

Hadoop 2.6.3

lzop native lib: hadoop-lzo-0.4.20-SNAPSHOT.jar
            Reporter: Liang Zhou
             Fix For: notrack
         Attachments: flume_shutdown.log

a flume process configured with the following parameters cause this problem:

*Configuration*

 
{code:java}
spool_flume1.sources = spool-source-spool
spool_flume1.channels = hdfs-channel-spool
spool_flume1.sinks = hdfs-sink-spool
spool_flume1.sources.spool-source-spool.type = spooldir
spool_flume1.sources.spool-source-spool.channels = hdfs-channel-spool
spool_flume1.sources.spool-source-spool.spoolDir = /home/test/flume_log
spool_flume1.sources.spool-source-spool.recursiveDirectorySearch = true
spool_flume1.sources.spool-source-spool.fileHeader = true
spool_flume1.sources.spool-source-spool.deserializer = LINE
spool_flume1.sources.spool-source-spool.deserializer.maxLineLength = 100000000
spool_flume1.sources.spool-source-spool.inputCharset = UTF-8
spool_flume1.sources.spool-source-spool.basenameHeader = true

spool_flume1.channels.hdfs-channel-spool.type = memory
spool_flume1.channels.hdfs-channel-spool.keep-alive = 60
spool_flume1.sinks.hdfs-sink-spool.channel = hdfs-channel-spool
spool_flume1.sinks.hdfs-sink-spool.type = hdfs
spool_flume1.sinks.hdfs-sink-spool.hdfs.writeFormat = Text
spool_flume1.sinks.hdfs-sink-spool.hdfs.fileType = CompressedStream
spool_flume1.sinks.hdfs-sink-spool.hdfs.codeC = lzop
spool_flume1.sinks.hdfs-sink-spool.hdfs.threadsPoolSize = 1
spool_flume1.sinks.hdfs-sink-spool.hdfs.callTimeout = 100000
spool_flume1.sinks.hdfs-sink-spool.hdfs.idleTimeout = 36
spool_flume1.sinks.hdfs-sink-spool.hdfs.useLocalTimeStamp = true
spool_flume1.sinks.hdfs-sink-spool.hdfs.filePrefix = %{basename}
spool_flume1.sinks.hdfs-sink-spool.hdfs.path = /user/test/flume_test
spool_flume1.sinks.hdfs-sink-spool.hdfs.rollCount = 0
spool_flume1.sinks.hdfs-sink-spool.hdfs.rollSize = 134217728
spool_flume1.sinks.hdfs-sink-spool.hdfs.rollInterval = 0
spool_flume1.sources.spool-source-spool.includePattern = log.*-1_2018.*$
spool_flume1.sources.spool-source-spool.batchSize = 100
spool_flume1.channels.hdfs-channel-spool.capacity = 1000
spool_flume1.channels.hdfs-channel-spool.transactionCapacity = 100
{code}
test data size add up to 4.2 G, amounts to 5271962 lines

 

expected data stored as lzop format and named files as 
%\{basename}_%\{LocalTimeStamp} on hdfs.

However,  found sink {color:#FF0000}data mixed in different files{color} in my 
tests and {color:#FF0000}total uploaded data amounts is less than local 
data{color}

our test cases listed below:
 * using DataStream, no matter set filePrefix = %\{basename} or not, uploading 
normally
 * using CompressedStream, hdfs.codec = lzop
 ** set filePrefix as default, uploading normally
 ** set filePrefix = %\{basename}, data mixed and loss

when shut down my flume agent process, it`s weird that flume.log print correct 
amounts but actually uploaded data is not that much. Log file attached in the 
end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@flume.apache.org
For additional commands, e-mail: issues-h...@flume.apache.org

Reply via email to