Hi Flume developers     We are trying to use flume in one project. Our current 
testing scenario is , in windows side, there are 3000 files(each about 4-5MB), 
we use flume to send them to remote Hadoop env. Totally source files bytes 
about 14GB,  and we get 80 files in hadoop with bz2 format.  After unzip we 
find in hadoop side totally we get about 13GB(so 1 GB missed).no error in flume 
log.     We change the code a bit, use FileDeserializer.java , it is very 
similar with LineDeserializer.java;but  comment/*if (c == '\n')  break;*/and 
erase the max limit in this code. We think one file will be one event
Do you have any ideas about it? Thanks in advance.
ENV: flume in window64, 1.5.0, remote apache hadoop 2.2 in linux
CONF:#agent1agent1.sources=source1agent1.sinks=sink1agent1.channels=channel1
#source1agent1.sources.source1.type=spooldiragent1.sources.source1.spoolDir=dataagent1.sources.source1.channels=channel1agent1.sources.source1.fileHeader=falseagent1.sources.source1.batchSize=3000agent1.sources.source1.deserializer=FILE
#sink1agent1.sinks.sink1.type=hdfsagent1.sinks.sink1.hdfs.path=hdfs://c0045305.itcs.hp.com:8120/user/QA/%y-%m-%d/%H%Magent1.sinks.sink1.hdfs.fileType=CompressedStreamagent1.sinks.sink1.hdfs.codeC=bzip2agent1.sinks.sink1.hdfs.writeFormat=TEXTagent1.sinks.sink1.hdfs.rollInterval=0agent1.sinks.sink1.hdfs.idleTimeout=120agent1.sinks.sink1.hdfs.rollSize=0agent1.sinks.sink1.hdfs.maxOpenFiles=10000agent1.sinks.sink1.hdfs.rollCount=0agent1.sinks.sink1.hdfs.batchSize=10000agent1.sinks.sink1.hdfs.callTimeout=60000agent1.sinks.sink1.hdfs.useLocalTimeStamp=trueagent1.sinks.sink1.hdfs.minBlockReplicas=1agent1.sinks.sink1.channel=channel1
#channel1agent1.channels.channel1.type=fileagent1.channels.maxFileSize=3146435071agent1.channels.channel1.checkpointDir=data_tmp123agent1.channels.channel1.dataDirs=dataChannelsagent1.channels.channel1.transactionCapacity=20000
Regards,Gary Xu                                           

Reply via email to