Thanks roshan. We found there is one large file 1.2 GB, while other file is 
about 3-10MB , after removing it, the data size looks matched.
Regards,Gary Xu

> Date: Fri, 5 Dec 2014 12:59:42 -0800
> Subject: Re: Flume Help
> From: [email protected]
> To: [email protected]
> 
> You may want to compare the input data with what is delivered. See if it is
> a case of missing lines or truncated lines.
> if it is a case of truncated lines .. then set the
> deserializer.maxLineLength
> 
> For spool dir on windows... you may need this patch FLUME-2508
> 
> -roshan
> 
> On Wed, Dec 3, 2014 at 6:47 PM, XuGary <[email protected]> wrote:
> 
> > Hi Flume developers     We are trying to use flume in one project. Our
> > current testing scenario is , in windows side, there are 3000 files(each
> > about 4-5MB), we use flume to send them to remote Hadoop env. Totally
> > source files bytes about 14GB,  and we get 80 files in hadoop with bz2
> > format.  After unzip we find in hadoop side totally we get about 13GB(so 1
> > GB missed).no error in flume log.     We change the code a bit, use
> > FileDeserializer.java , it is very similar with LineDeserializer.java;but
> > comment/*if (c == '\n')  break;*/and erase the max limit in this code. We
> > think one file will be one event
> > Do you have any ideas about it? Thanks in advance.
> > ENV: flume in window64, 1.5.0, remote apache hadoop 2.2 in linux
> >
> > CONF:#agent1agent1.sources=source1agent1.sinks=sink1agent1.channels=channel1
> >
> > #source1agent1.sources.source1.type=spooldiragent1.sources.source1.spoolDir=dataagent1.sources.source1.channels=channel1agent1.sources.source1.fileHeader=falseagent1.sources.source1.batchSize=3000agent1.sources.source1.deserializer=FILE
> > #sink1agent1.sinks.sink1.type=hdfsagent1.sinks.sink1.hdfs.path=hdfs://
> > c0045305.itcs.hp.com:8120/user/QA/%y-%m-%d/%H%Magent1.sinks.sink1.hdfs.fileType=CompressedStreamagent1.sinks.sink1.hdfs.codeC=bzip2agent1.sinks.sink1.hdfs.writeFormat=TEXTagent1.sinks.sink1.hdfs.rollInterval=0agent1.sinks.sink1.hdfs.idleTimeout=120agent1.sinks.sink1.hdfs.rollSize=0agent1.sinks.sink1.hdfs.maxOpenFiles=10000agent1.sinks.sink1.hdfs.rollCount=0agent1.sinks.sink1.hdfs.batchSize=10000agent1.sinks.sink1.hdfs.callTimeout=60000agent1.sinks.sink1.hdfs.useLocalTimeStamp=trueagent1.sinks.sink1.hdfs.minBlockReplicas=1agent1.sinks.sink1.channel=channel1
> >
> > #channel1agent1.channels.channel1.type=fileagent1.channels.maxFileSize=3146435071agent1.channels.channel1.checkpointDir=data_tmp123agent1.channels.channel1.dataDirs=dataChannelsagent1.channels.channel1.transactionCapacity=20000
> > Regards,Gary Xu
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.
                                          

Reply via email to