Thanks roshan. We found there is one large file 1.2 GB, while other file is about 3-10MB , after removing it, the data size looks matched. Regards,Gary Xu
> Date: Fri, 5 Dec 2014 12:59:42 -0800 > Subject: Re: Flume Help > From: [email protected] > To: [email protected] > > You may want to compare the input data with what is delivered. See if it is > a case of missing lines or truncated lines. > if it is a case of truncated lines .. then set the > deserializer.maxLineLength > > For spool dir on windows... you may need this patch FLUME-2508 > > -roshan > > On Wed, Dec 3, 2014 at 6:47 PM, XuGary <[email protected]> wrote: > > > Hi Flume developers We are trying to use flume in one project. Our > > current testing scenario is , in windows side, there are 3000 files(each > > about 4-5MB), we use flume to send them to remote Hadoop env. Totally > > source files bytes about 14GB, and we get 80 files in hadoop with bz2 > > format. After unzip we find in hadoop side totally we get about 13GB(so 1 > > GB missed).no error in flume log. We change the code a bit, use > > FileDeserializer.java , it is very similar with LineDeserializer.java;but > > comment/*if (c == '\n') break;*/and erase the max limit in this code. We > > think one file will be one event > > Do you have any ideas about it? Thanks in advance. > > ENV: flume in window64, 1.5.0, remote apache hadoop 2.2 in linux > > > > CONF:#agent1agent1.sources=source1agent1.sinks=sink1agent1.channels=channel1 > > > > #source1agent1.sources.source1.type=spooldiragent1.sources.source1.spoolDir=dataagent1.sources.source1.channels=channel1agent1.sources.source1.fileHeader=falseagent1.sources.source1.batchSize=3000agent1.sources.source1.deserializer=FILE > > #sink1agent1.sinks.sink1.type=hdfsagent1.sinks.sink1.hdfs.path=hdfs:// > > c0045305.itcs.hp.com:8120/user/QA/%y-%m-%d/%H%Magent1.sinks.sink1.hdfs.fileType=CompressedStreamagent1.sinks.sink1.hdfs.codeC=bzip2agent1.sinks.sink1.hdfs.writeFormat=TEXTagent1.sinks.sink1.hdfs.rollInterval=0agent1.sinks.sink1.hdfs.idleTimeout=120agent1.sinks.sink1.hdfs.rollSize=0agent1.sinks.sink1.hdfs.maxOpenFiles=10000agent1.sinks.sink1.hdfs.rollCount=0agent1.sinks.sink1.hdfs.batchSize=10000agent1.sinks.sink1.hdfs.callTimeout=60000agent1.sinks.sink1.hdfs.useLocalTimeStamp=trueagent1.sinks.sink1.hdfs.minBlockReplicas=1agent1.sinks.sink1.channel=channel1 > > > > #channel1agent1.channels.channel1.type=fileagent1.channels.maxFileSize=3146435071agent1.channels.channel1.checkpointDir=data_tmp123agent1.channels.channel1.dataDirs=dataChannelsagent1.channels.channel1.transactionCapacity=20000 > > Regards,Gary Xu > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You.
