You may want to compare the input data with what is delivered. See if it is a case of missing lines or truncated lines. if it is a case of truncated lines .. then set the deserializer.maxLineLength
For spool dir on windows... you may need this patch FLUME-2508 -roshan On Wed, Dec 3, 2014 at 6:47 PM, XuGary <[email protected]> wrote: > Hi Flume developers We are trying to use flume in one project. Our > current testing scenario is , in windows side, there are 3000 files(each > about 4-5MB), we use flume to send them to remote Hadoop env. Totally > source files bytes about 14GB, and we get 80 files in hadoop with bz2 > format. After unzip we find in hadoop side totally we get about 13GB(so 1 > GB missed).no error in flume log. We change the code a bit, use > FileDeserializer.java , it is very similar with LineDeserializer.java;but > comment/*if (c == '\n') break;*/and erase the max limit in this code. We > think one file will be one event > Do you have any ideas about it? Thanks in advance. > ENV: flume in window64, 1.5.0, remote apache hadoop 2.2 in linux > > CONF:#agent1agent1.sources=source1agent1.sinks=sink1agent1.channels=channel1 > > #source1agent1.sources.source1.type=spooldiragent1.sources.source1.spoolDir=dataagent1.sources.source1.channels=channel1agent1.sources.source1.fileHeader=falseagent1.sources.source1.batchSize=3000agent1.sources.source1.deserializer=FILE > #sink1agent1.sinks.sink1.type=hdfsagent1.sinks.sink1.hdfs.path=hdfs:// > c0045305.itcs.hp.com:8120/user/QA/%y-%m-%d/%H%Magent1.sinks.sink1.hdfs.fileType=CompressedStreamagent1.sinks.sink1.hdfs.codeC=bzip2agent1.sinks.sink1.hdfs.writeFormat=TEXTagent1.sinks.sink1.hdfs.rollInterval=0agent1.sinks.sink1.hdfs.idleTimeout=120agent1.sinks.sink1.hdfs.rollSize=0agent1.sinks.sink1.hdfs.maxOpenFiles=10000agent1.sinks.sink1.hdfs.rollCount=0agent1.sinks.sink1.hdfs.batchSize=10000agent1.sinks.sink1.hdfs.callTimeout=60000agent1.sinks.sink1.hdfs.useLocalTimeStamp=trueagent1.sinks.sink1.hdfs.minBlockReplicas=1agent1.sinks.sink1.channel=channel1 > > #channel1agent1.channels.channel1.type=fileagent1.channels.maxFileSize=3146435071agent1.channels.channel1.checkpointDir=data_tmp123agent1.channels.channel1.dataDirs=dataChannelsagent1.channels.channel1.transactionCapacity=20000 > Regards,Gary Xu -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
