[ 
https://issues.apache.org/jira/browse/FLUME-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicho Zhou updated FLUME-3221:
------------------------------
    Description: 
a flume process configured with the following parameters cause this problem:

*Configuration*
{code:java}
spool_flume1.sources = spool-source-spool
spool_flume1.channels = hdfs-channel-spool
spool_flume1.sinks = hdfs-sink-spool
spool_flume1.sources.spool-source-spool.type = spooldir
spool_flume1.sources.spool-source-spool.channels = hdfs-channel-spool
spool_flume1.sources.spool-source-spool.spoolDir = /home/test/flume_log
spool_flume1.sources.spool-source-spool.recursiveDirectorySearch = true
spool_flume1.sources.spool-source-spool.fileHeader = true
spool_flume1.sources.spool-source-spool.deserializer = LINE
spool_flume1.sources.spool-source-spool.deserializer.maxLineLength = 100000000
spool_flume1.sources.spool-source-spool.inputCharset = UTF-8
spool_flume1.sources.spool-source-spool.basenameHeader = true

spool_flume1.channels.hdfs-channel-spool.type = memory
spool_flume1.channels.hdfs-channel-spool.keep-alive = 60
spool_flume1.sinks.hdfs-sink-spool.channel = hdfs-channel-spool
spool_flume1.sinks.hdfs-sink-spool.type = hdfs
spool_flume1.sinks.hdfs-sink-spool.hdfs.writeFormat = Text
spool_flume1.sinks.hdfs-sink-spool.hdfs.fileType = CompressedStream
spool_flume1.sinks.hdfs-sink-spool.hdfs.codeC = lzop
spool_flume1.sinks.hdfs-sink-spool.hdfs.threadsPoolSize = 1
spool_flume1.sinks.hdfs-sink-spool.hdfs.callTimeout = 100000
spool_flume1.sinks.hdfs-sink-spool.hdfs.idleTimeout = 36
spool_flume1.sinks.hdfs-sink-spool.hdfs.useLocalTimeStamp = true
spool_flume1.sinks.hdfs-sink-spool.hdfs.filePrefix = %{basename}
spool_flume1.sinks.hdfs-sink-spool.hdfs.path = /user/test/flume_test
spool_flume1.sinks.hdfs-sink-spool.hdfs.rollCount = 0
spool_flume1.sinks.hdfs-sink-spool.hdfs.rollSize = 134217728
spool_flume1.sinks.hdfs-sink-spool.hdfs.rollInterval = 0
spool_flume1.sources.spool-source-spool.includePattern = log.*-1_2018.*$
spool_flume1.sources.spool-source-spool.batchSize = 100
spool_flume1.channels.hdfs-channel-spool.capacity = 1000
spool_flume1.channels.hdfs-channel-spool.transactionCapacity = 100
{code}
test data size add up to 4.2 G, amounts to 5271962 lines

 

expected data stored as lzop format and named files as 
%\{basename}_%\{LocalTimeStamp} on hdfs.

However,  found sink {color:#ff0000}data mixed in different files{color} in my 
tests and {color:#ff0000}total uploaded data amounts is less than local 
data{color}

Test cases listed below:
 * using DataStream, no matter set filePrefix = %\{basename} or not, uploading 
normally
 * using CompressedStream, hdfs.codec = lzop
 ** set filePrefix as default, uploading normally
 ** set filePrefix = %\{basename}, data mixed and loss

when shut down my flume agent process, it`s weird that flume.log print correct 
amounts but actually uploaded data is not that much. Log file attached in the 
end.

  was:
a flume process configured with the following parameters cause this problem:

*Configuration*
{code:java}
spool_flume1.sources = spool-source-spool
spool_flume1.channels = hdfs-channel-spool
spool_flume1.sinks = hdfs-sink-spool
spool_flume1.sources.spool-source-spool.type = spooldir
spool_flume1.sources.spool-source-spool.channels = hdfs-channel-spool
spool_flume1.sources.spool-source-spool.spoolDir = /home/test/flume_log
spool_flume1.sources.spool-source-spool.recursiveDirectorySearch = true
spool_flume1.sources.spool-source-spool.fileHeader = true
spool_flume1.sources.spool-source-spool.deserializer = LINE
spool_flume1.sources.spool-source-spool.deserializer.maxLineLength = 100000000
spool_flume1.sources.spool-source-spool.inputCharset = UTF-8
spool_flume1.sources.spool-source-spool.basenameHeader = true

spool_flume1.channels.hdfs-channel-spool.type = memory
spool_flume1.channels.hdfs-channel-spool.keep-alive = 60
spool_flume1.sinks.hdfs-sink-spool.channel = hdfs-channel-spool
spool_flume1.sinks.hdfs-sink-spool.type = hdfs
spool_flume1.sinks.hdfs-sink-spool.hdfs.writeFormat = Text
spool_flume1.sinks.hdfs-sink-spool.hdfs.fileType = CompressedStream
spool_flume1.sinks.hdfs-sink-spool.hdfs.codeC = lzop
spool_flume1.sinks.hdfs-sink-spool.hdfs.threadsPoolSize = 1
spool_flume1.sinks.hdfs-sink-spool.hdfs.callTimeout = 100000
spool_flume1.sinks.hdfs-sink-spool.hdfs.idleTimeout = 36
spool_flume1.sinks.hdfs-sink-spool.hdfs.useLocalTimeStamp = true
spool_flume1.sinks.hdfs-sink-spool.hdfs.filePrefix = %{basename}
spool_flume1.sinks.hdfs-sink-spool.hdfs.path = /user/test/flume_test
spool_flume1.sinks.hdfs-sink-spool.hdfs.rollCount = 0
spool_flume1.sinks.hdfs-sink-spool.hdfs.rollSize = 134217728
spool_flume1.sinks.hdfs-sink-spool.hdfs.rollInterval = 0
spool_flume1.sources.spool-source-spool.includePattern = log.*-1_2018.*$
spool_flume1.sources.spool-source-spool.batchSize = 100
spool_flume1.channels.hdfs-channel-spool.capacity = 1000
spool_flume1.channels.hdfs-channel-spool.transactionCapacity = 100
{code}
test data size add up to 4.2 G, amounts to 5271962 lines

 

expected data stored as lzop format and named files as 
%\{basename}_%\{LocalTimeStamp} on hdfs.

However,  found sink {color:#ff0000}data mixed in different files{color} in my 
tests and {color:#ff0000}total uploaded data amounts is less than local 
data{color}

our test cases listed below:
 * using DataStream, no matter set filePrefix = %\{basename} or not, uploading 
normally
 * using CompressedStream, hdfs.codec = lzop
 ** set filePrefix as default, uploading normally
 ** set filePrefix = %\{basename}, data mixed and loss

when shut down my flume agent process, it`s weird that flume.log print correct 
amounts but actually uploaded data is not that much. Log file attached in the 
end.


> using spooling dir and hdfs sink(hdfs.codec = lzop), found data loss when set 
> hdfs.filePrefix = %{basename}
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: FLUME-3221
>                 URL: https://issues.apache.org/jira/browse/FLUME-3221
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: 1.8.0
>         Environment: Java version "1.8.0_151"
> Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
> Hadoop 2.6.3
> lzop native lib: hadoop-lzo-0.4.20-SNAPSHOT.jar
>            Reporter: Nicho Zhou
>            Priority: Major
>             Fix For: notrack
>
>         Attachments: flume_shutdown.log
>
>
> a flume process configured with the following parameters cause this problem:
> *Configuration*
> {code:java}
> spool_flume1.sources = spool-source-spool
> spool_flume1.channels = hdfs-channel-spool
> spool_flume1.sinks = hdfs-sink-spool
> spool_flume1.sources.spool-source-spool.type = spooldir
> spool_flume1.sources.spool-source-spool.channels = hdfs-channel-spool
> spool_flume1.sources.spool-source-spool.spoolDir = /home/test/flume_log
> spool_flume1.sources.spool-source-spool.recursiveDirectorySearch = true
> spool_flume1.sources.spool-source-spool.fileHeader = true
> spool_flume1.sources.spool-source-spool.deserializer = LINE
> spool_flume1.sources.spool-source-spool.deserializer.maxLineLength = 100000000
> spool_flume1.sources.spool-source-spool.inputCharset = UTF-8
> spool_flume1.sources.spool-source-spool.basenameHeader = true
> spool_flume1.channels.hdfs-channel-spool.type = memory
> spool_flume1.channels.hdfs-channel-spool.keep-alive = 60
> spool_flume1.sinks.hdfs-sink-spool.channel = hdfs-channel-spool
> spool_flume1.sinks.hdfs-sink-spool.type = hdfs
> spool_flume1.sinks.hdfs-sink-spool.hdfs.writeFormat = Text
> spool_flume1.sinks.hdfs-sink-spool.hdfs.fileType = CompressedStream
> spool_flume1.sinks.hdfs-sink-spool.hdfs.codeC = lzop
> spool_flume1.sinks.hdfs-sink-spool.hdfs.threadsPoolSize = 1
> spool_flume1.sinks.hdfs-sink-spool.hdfs.callTimeout = 100000
> spool_flume1.sinks.hdfs-sink-spool.hdfs.idleTimeout = 36
> spool_flume1.sinks.hdfs-sink-spool.hdfs.useLocalTimeStamp = true
> spool_flume1.sinks.hdfs-sink-spool.hdfs.filePrefix = %{basename}
> spool_flume1.sinks.hdfs-sink-spool.hdfs.path = /user/test/flume_test
> spool_flume1.sinks.hdfs-sink-spool.hdfs.rollCount = 0
> spool_flume1.sinks.hdfs-sink-spool.hdfs.rollSize = 134217728
> spool_flume1.sinks.hdfs-sink-spool.hdfs.rollInterval = 0
> spool_flume1.sources.spool-source-spool.includePattern = log.*-1_2018.*$
> spool_flume1.sources.spool-source-spool.batchSize = 100
> spool_flume1.channels.hdfs-channel-spool.capacity = 1000
> spool_flume1.channels.hdfs-channel-spool.transactionCapacity = 100
> {code}
> test data size add up to 4.2 G, amounts to 5271962 lines
>  
> expected data stored as lzop format and named files as 
> %\{basename}_%\{LocalTimeStamp} on hdfs.
> However,  found sink {color:#ff0000}data mixed in different files{color} in 
> my tests and {color:#ff0000}total uploaded data amounts is less than local 
> data{color}
> Test cases listed below:
>  * using DataStream, no matter set filePrefix = %\{basename} or not, 
> uploading normally
>  * using CompressedStream, hdfs.codec = lzop
>  ** set filePrefix as default, uploading normally
>  ** set filePrefix = %\{basename}, data mixed and loss
> when shut down my flume agent process, it`s weird that flume.log print 
> correct amounts but actually uploaded data is not that much. Log file 
> attached in the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@flume.apache.org
For additional commands, e-mail: issues-h...@flume.apache.org

Reply via email to