Hi Flume developers We are trying to use flume in one project. Our current
testing scenario is , in windows side, there are 3000 files(each about 4-5MB),
we use flume to send them to remote Hadoop env. Totally source files bytes
about 14GB, and we get 80 files in hadoop with bz2 format. After unzip we
find in hadoop side totally we get about 13GB(so 1 GB missed).no error in flume
log. We change the code a bit, use FileDeserializer.java , it is very
similar with LineDeserializer.java;but comment/*if (c == '\n') break;*/and
erase the max limit in this code. We think one file will be one event
Do you have any ideas about it? Thanks in advance.
ENV: flume in window64, 1.5.0, remote apache hadoop 2.2 in linux
CONF:#agent1agent1.sources=source1agent1.sinks=sink1agent1.channels=channel1
#source1agent1.sources.source1.type=spooldiragent1.sources.source1.spoolDir=dataagent1.sources.source1.channels=channel1agent1.sources.source1.fileHeader=falseagent1.sources.source1.batchSize=3000agent1.sources.source1.deserializer=FILE
#sink1agent1.sinks.sink1.type=hdfsagent1.sinks.sink1.hdfs.path=hdfs://c0045305.itcs.hp.com:8120/user/QA/%y-%m-%d/%H%Magent1.sinks.sink1.hdfs.fileType=CompressedStreamagent1.sinks.sink1.hdfs.codeC=bzip2agent1.sinks.sink1.hdfs.writeFormat=TEXTagent1.sinks.sink1.hdfs.rollInterval=0agent1.sinks.sink1.hdfs.idleTimeout=120agent1.sinks.sink1.hdfs.rollSize=0agent1.sinks.sink1.hdfs.maxOpenFiles=10000agent1.sinks.sink1.hdfs.rollCount=0agent1.sinks.sink1.hdfs.batchSize=10000agent1.sinks.sink1.hdfs.callTimeout=60000agent1.sinks.sink1.hdfs.useLocalTimeStamp=trueagent1.sinks.sink1.hdfs.minBlockReplicas=1agent1.sinks.sink1.channel=channel1
#channel1agent1.channels.channel1.type=fileagent1.channels.maxFileSize=3146435071agent1.channels.channel1.checkpointDir=data_tmp123agent1.channels.channel1.dataDirs=dataChannelsagent1.channels.channel1.transactionCapacity=20000
Regards,Gary Xu