Viktor Trako created FLUME-2241:
-----------------------------------

             Summary: Spooling Directory Source doesn't handle files with 
large-ish event data
                 Key: FLUME-2241
                 URL: https://issues.apache.org/jira/browse/FLUME-2241
             Project: Flume
          Issue Type: Bug
    Affects Versions: v1.4.0
         Environment: Debian 6.0.5
            Reporter: Viktor Trako


I have a flume agent set up with a spooling directory source sinking data to 
cassandra.

I'm collecting web data writing a line in the log file for each request then 
once the log file has been rotated is dropped into the spooling directory ready 
for flume to start processing it. All data is valid json as its validated prior 
to it being written to the log file.

Sending a mixture of different sized requests from 9-15k seems fine. Generated 
a log file of over 400Mb and it all sinked correctly.

I'm currently logging a 19k request and this is when things start to break. It 
only gets as far as 1800th request in the file and the next one is truncated.

Changed the sink to a file-roll sink and it only gets as far as 29Mb

I have profiled it and it's not running out of memory. I want to know if there 
are any limitations on the spooling directory source.

Has anyone tried dropping a file with similarly large requests and experienced 
a similar issue.

Any pointers would be greatly appreciated. My flume config is as follows

{code:title=flume_conf|borderStyle=solid}
orion.sources = spoolDir
orion.channels = fileChannel
orion.sinks= cassandra

orion.channels.fileChannel.type = file
orion.channels.fileChannel.capacity = 1000000
orion.channels.fileChannel.transactionCapacity = 100
orion.channels.fileChannel.keep-alive = 60
orion.channels.fileChannel.write-timeout = 60

orion.sinks.cassandra.type = com.btoddb.flume.sinks.cassandra.CassandraSink
orion.sinks.cassandra.hosts = <cluster node ip>
orion.sinks.cassandra.cluster_name = fake_cluster
orion.sinks.cassandra.port = 9160
orion.sinks.cassandra.keyspace-name = Keysp
orion.sinks.cassandra.records-colfam = <table>

orion.sources.spoolDir.type = spooldir
orion.sources.spoolDir.spoolDir = /var/log/orion/flumeSpooling
orion.sources.spoolDir.deserializer = LINE
orion.sources.spoolDir.inputCharset = UTF-8
orion.sources.spoolDir.deserializer.maxLineLength = 20000000
orion.sources.spoolDir.deletePolicy = never
orion.sources.spoolDir.batchSize = 100
orion.sources.spoolDir.interceptors = addSrc addHost addTimestamp addUUID

orion.sources.spoolDir.interceptors.addSrc.type = regex_extractor
orion.sources.spoolDir.interceptors.addSrc.regex = \"service\"\:\"([^"]*)
orion.sources.spoolDir.interceptors.addSrc.serializers = s1
orion.sources.spoolDir.interceptors.addSrc.serializers.s1.name = src

orion.sources.spoolDir.interceptors.addUUID.type = regex_extractor
orion.sources.spoolDir.interceptors.addUUID.regex = \"uuid\"\:\"([^"]*)
orion.sources.spoolDir.interceptors.addUUID.serializers = s1
orion.sources.spoolDir.interceptors.addUUID.serializers.s1.name = key

orion.sources.spoolDir.interceptors.addHost.type = 
org.apache.flume.interceptor.HostInterceptor$Builder
orion.sources.spoolDir.interceptors.addHost.preserveExisting = false
orion.sources.spoolDir.interceptors.addHost.useIP = true
orion.sources.spoolDir.interceptors.addHost.hostHeader = host

orion.sources.spoolDir.interceptors.addTimestamp.type = regex_extractor
orion.sources.spoolDir.interceptors.addTimestamp.regex = 
\"timestamp\"\:\"([^"]*)
orion.sources.spoolDir.interceptors.addTimestamp.serializers = s1
orion.sources.spoolDir.interceptors.addTimestamp.serializers.s1.name = timestamp

orion.sources.spoolDir.channels = fileChannel
orion.sinks.cassandra.channel = fileChannel
{code}

Is this potentially a bug?.. If not tried can someone try to recreate - I hope 
the same error would occur.

Dont hesitate to contact me for further info.

Viktor



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to