Praveen created FLUME-2364:
------------------------------

             Summary: netcat source and HDFS sink. Performance problem
                 Key: FLUME-2364
                 URL: https://issues.apache.org/jira/browse/FLUME-2364
             Project: Flume
          Issue Type: Test
          Components: Configuration
            Reporter: Praveen


1. We have a csv file, size ~ 1GB
2. We tried to store it to HDFS using hadoop fs -put. It took ~10 seconds.
3. We try to use Flume 1.2 with netcat source and HFDS sink and we get serious 
perfomance problem. It takes ~ 20 minutes to store file. Also HDFS sink doesn't 
store it to single files. It create a lot of files, size of each is ~2 MB.

Our goal is: 
1. send csv files to HDFS. We send file a1.csv to flume and get a1.csv in HDFS.
2. We do send these files one by one.
3. We want HDFS sink to close file after it was been received. 

Here is our configuration:

httpptpt.sources = httpptpt_src
httpptpt.channels = httpptpt_channel
httpptpt.sinks = httpptpt_sink

# источники
httpptpt.sources.httpptpt_src.type = netcat
httpptpt.sources.httpptpt_src.bind = 10.66.48.23
httpptpt.sources.httpptpt_src.port = 6969
httpptpt.sources.httpptpt_src.ack-every-event = false
#default size is 512B
#httpptpt.sources.httpptpt_src.max-line-length = 4096 
httpptpt.sources.httpptpt_src.channels = httpptpt_channel

# channel
httpptpt.channels.httpptpt_channel.type = memory
#Seems like we don't understand how it works :( With default values it doesn't 
work (capacity=100, transaction capacity= 100). Memory channel has no room for 
storing incomming lines
#httpptpt.channels.httpptpt_channel.capacity = 100000
#httpptpt.channels.httpptpt_channel.transactionCapacity = 1000
#Defaul is 3 sec
#httpptpt.channels.httpptpt_channel.keep-alive = 1 

# sink
httpptpt.sinks.httpptpt_sink.channel = httpptpt_channel
httpptpt.sinks.httpptpt_sink.type = hdfs
httpptpt.sinks.httpptpt_sink.hdfs.path = hdfs://10.66.48.23/user/httpptpt/
httpptpt.sinks.httpptpt_sink.hdfs.fileType = DataStream
httpptpt.sinks.httpptpt_sink.hdfs.writeFormat = Writable
httpptpt.sinks.httpptpt_sink.hdfs.filePrefix = httpptpt
httpptpt.sinks.httpptpt_sink.hdfs.threadsPoolSize = 10
#We want HDFS sink roll temp file after source stops to emit lines
#httpptpt.sinks.httpptpt_sink.hdfs.rollSize = 10485760000 
httpptpt.sinks.httpptpt_sink.hdfs.rollSize = 0
#httpptpt.sinks.httpptpt_sink.hdfs.rollCount = 6000000
httpptpt.sinks.httpptpt_sink.hdfs.rollCount = 0
httpptpt.sinks.httpptpt_sink.hdfs.rollInterval = 0
#??? Source doesn't emit messages for 10 seconds, then rool the file
httpptpt.sinks.httpptpt_sink.hdfs.idleTimeout = 10

What do we do wrong?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to