[
https://issues.apache.org/jira/browse/FLUME-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829109#comment-13829109
]
Viktor Trako edited comment on FLUME-2241 at 11/21/13 5:22 PM:
---------------------------------------------------------------
I can see the same problem will occur with all characters made up of 2 bytes
||unicode||character||UTF-8(hex)||
|U+0080| | c2 80|
|U+0081| | c2 81|
|U+008| | c2 82|
|U+0083| | c2 83|
|U+0084| | c2 84|
|U+0085| | c2 85|
|U+0086| | c2 86|
|U+0087| | c2 87|
|U+0088| | c2 88|
|U+0089| | c2 89|
|U+008A| | c2 8a|
|U+008B| | c2 8b|
|U+008C| | c2 8c|
|U+008D| | c2 8d|
|U+008E| | c2 8e|
|U+008F| | c2 8f|
|U+0090| | c2 90|
|U+0091| | c2 91|
|U+0092| | c2 92|
|U+0093| | c2 93|
|U+0094| | c2 94|
|U+0095| | c2 95|
|U+0096| | c2 96|
|U+0097| | c2 97|
|U+0098| | c2 98|
|U+0099| | c2 99|
|U+009A| | c2 9a|
|U+009B| | c2 9b|
|U+009C| | c2 9c|
|U+009D| | c2 9d|
|U+009E| | c2 9e|
|U+009F| | c2 9f|
|U+00A0| | c2 a0|
|U+00A1| ¡| c2 a1|
|U+00A2| ¢| c2 a2|
|U+00A3| £| c2 a3|
|U+00A4| ¤| c2 a4|
|U+00A5| ¥| c2 a5|
|U+00A6| ¦| c2 a6|
|U+00A7| §| c2 a7|
|U+00A8| ¨| c2 a8|
|U+00A9| ©| c2 a9|
|U+00AA| ª| c2 aa|
|U+00AB| «| c2 ab|
|U+00AC| ¬| c2 ac|
|U+00AD| | c2 ad|
|U+00AE| ®| c2 ae|
|U+00AF| ¯| c2 af|
|U+00B0| °| c2 b0|
|U+00B1| ±| c2 b1|
|U+00B2| ²| c2 b2|
|U+00B3| ³| c2 b3|
|U+00B4| ´| c2 b4|
|U+00B5| µ| c2 b5|
|U+00B6| ¶| c2 b6|
|U+00B7| ·| c2 b7|
|U+00B8| ¸| c2 b8|
|U+00B9| ¹| c2 b9|
|U+00BA| º| c2 ba|
|U+00BB| »| c2 bb|
|U+00BC| ¼| c2 bc|
|U+00BD| ½| c2 bd|
|U+00BE| ¾| c2 be|
|U+00BF| ¿| c2 bf|
|U+00C0| À| c3 80|
|U+00C1| Á| c3 81|
|U+00C2| Â| c3 82|
|U+00C3| Ã| c3 83|
|U+00C4| Ä| c3 84|
|U+00C5| Å| c3 85|
|U+00C6| Æ| c3 86|
|U+00C7| Ç| c3 87|
|U+00C8| È| c3 88|
|U+00C9| É| c3 89|
|U+00CA| Ê| c3 8a|
|U+00CB| Ë| c3 8b|
|U+00CC| Ì| c3 8c|
|U+00CD| Í| c3 8d|
|U+00CE| Î| c3 8e|
|U+00CF| Ï| c3 8f|
|U+00D0| Ð| c3 90|
|U+00D1| Ñ| c3 91|
|U+00D2| Ò| c3 92|
|U+00D3| Ó| c3 93|
|U+00D4| Ô| c3 94|
|U+00D5| Õ| c3 95|
|U+00D6| Ö| c3 96|
|U+00D7| ×| c3 97|
|U+00D8| Ø| c3 98|
|U+00D9| Ù| c3 99|
|U+00DA| Ú| c3 9a|
|U+00DB| Û| c3 9b|
|U+00DC| Ü| c3 9c|
|U+00DD| Ý| c3 9d|
|U+00DE| Þ| c3 9e|
|U+00DF| ß| c3 9f|
|U+00E0| à| c3 a0|
|U+00E1| á| c3 a1|
|U+00E2| â| c3 a2|
|U+00E3| ã| c3 a3|
|U+00E4| ä| c3 a4|
|U+00E5| å| c3 a5|
|U+00E6| æ| c3 a6|
|U+00E7| ç| c3 a7|
|U+00E8| è| c3 a8|
|U+00E9| é| c3 a9|
|U+00EA| ê| c3 aa|
|U+00EB| ë| c3 ab|
|U+00EC| ì| c3 ac|
|U+00ED| í| c3 ad|
|U+00EE| î| c3 ae|
|U+00EF| ï| c3 af|
|U+00F0| ð| c3 b0|
|U+00F1| ñ| c3 b1|
|U+00F2| ò| c3 b2|
|U+00F3| ó| c3 b3|
|U+00F4| ô| c3 b4|
|U+00F5| õ| c3 b5|
|U+00F6| ö| c3 b6|
|U+00F7| ÷| c3 b7|
|U+00F8| ø| c3 b8|
|U+00F9| ù| c3 b9|
|U+00FA| ú| c3 ba|
|U+00FB| û| c3 bb|
|U+00FC| ü| c3 bc|
|U+00FD| ý| c3 bd|
|U+00FE| þ| c3 be|
|U+00FF| ÿ| c3 bf |
was (Author: viktort):
I can see the same problem will occur with all characters made up of 2 bytes
||unicode|| ||character|| ||UTF-8(hex)||
|U+0080| | c2 80|
|U+0081| | c2 81|
|U+008| | c2 82|
|U+0083| | c2 83|
|U+0084| | c2 84|
|U+0085| | c2 85|
|U+0086| | c2 86|
|U+0087| | c2 87|
|U+0088| | c2 88|
|U+0089| | c2 89|
|U+008A| | c2 8a|
|U+008B| | c2 8b|
|U+008C| | c2 8c|
|U+008D| | c2 8d|
|U+008E| | c2 8e|
|U+008F| | c2 8f|
|U+0090| | c2 90|
|U+0091| | c2 91|
|U+0092| | c2 92|
|U+0093| | c2 93|
|U+0094| | c2 94|
|U+0095| | c2 95|
|U+0096| | c2 96|
|U+0097| | c2 97|
|U+0098| | c2 98|
|U+0099| | c2 99|
|U+009A| | c2 9a|
|U+009B| | c2 9b|
|U+009C| | c2 9c|
|U+009D| | c2 9d|
|U+009E| | c2 9e|
|U+009F| | c2 9f|
|U+00A0| | c2 a0|
|U+00A1| ¡| c2 a1|
|U+00A2| ¢| c2 a2|
|U+00A3| £| c2 a3|
|U+00A4| ¤| c2 a4|
|U+00A5| ¥| c2 a5|
|U+00A6| ¦| c2 a6|
|U+00A7| §| c2 a7|
|U+00A8| ¨| c2 a8|
|U+00A9| ©| c2 a9|
|U+00AA| ª| c2 aa|
|U+00AB| «| c2 ab|
|U+00AC| ¬| c2 ac|
|U+00AD| | c2 ad|
|U+00AE| ®| c2 ae|
|U+00AF| ¯| c2 af|
|U+00B0| °| c2 b0|
|U+00B1| ±| c2 b1|
|U+00B2| ²| c2 b2|
|U+00B3| ³| c2 b3|
|U+00B4| ´| c2 b4|
|U+00B5| µ| c2 b5|
|U+00B6| ¶| c2 b6|
|U+00B7| ·| c2 b7|
|U+00B8| ¸| c2 b8|
|U+00B9| ¹| c2 b9|
|U+00BA| º| c2 ba|
|U+00BB| »| c2 bb|
|U+00BC| ¼| c2 bc|
|U+00BD| ½| c2 bd|
|U+00BE| ¾| c2 be|
|U+00BF| ¿| c2 bf|
|U+00C0| À| c3 80|
|U+00C1| Á| c3 81|
|U+00C2| Â| c3 82|
|U+00C3| Ã| c3 83|
|U+00C4| Ä| c3 84|
|U+00C5| Å| c3 85|
|U+00C6| Æ| c3 86|
|U+00C7| Ç| c3 87|
|U+00C8| È| c3 88|
|U+00C9| É| c3 89|
|U+00CA| Ê| c3 8a|
|U+00CB| Ë| c3 8b|
|U+00CC| Ì| c3 8c|
|U+00CD| Í| c3 8d|
|U+00CE| Î| c3 8e|
|U+00CF| Ï| c3 8f|
|U+00D0| Ð| c3 90|
|U+00D1| Ñ| c3 91|
|U+00D2| Ò| c3 92|
|U+00D3| Ó| c3 93|
|U+00D4| Ô| c3 94|
|U+00D5| Õ| c3 95|
|U+00D6| Ö| c3 96|
|U+00D7| ×| c3 97|
|U+00D8| Ø| c3 98|
|U+00D9| Ù| c3 99|
|U+00DA| Ú| c3 9a|
|U+00DB| Û| c3 9b|
|U+00DC| Ü| c3 9c|
|U+00DD| Ý| c3 9d|
|U+00DE| Þ| c3 9e|
|U+00DF| ß| c3 9f|
|U+00E0| à| c3 a0|
|U+00E1| á| c3 a1|
|U+00E2| â| c3 a2|
|U+00E3| ã| c3 a3|
|U+00E4| ä| c3 a4|
|U+00E5| å| c3 a5|
|U+00E6| æ| c3 a6|
|U+00E7| ç| c3 a7|
|U+00E8| è| c3 a8|
|U+00E9| é| c3 a9|
|U+00EA| ê| c3 aa|
|U+00EB| ë| c3 ab|
|U+00EC| ì| c3 ac|
|U+00ED| í| c3 ad|
|U+00EE| î| c3 ae|
|U+00EF| ï| c3 af|
|U+00F0| ð| c3 b0|
|U+00F1| ñ| c3 b1|
|U+00F2| ò| c3 b2|
|U+00F3| ó| c3 b3|
|U+00F4| ô| c3 b4|
|U+00F5| õ| c3 b5|
|U+00F6| ö| c3 b6|
|U+00F7| ÷| c3 b7|
|U+00F8| ø| c3 b8|
|U+00F9| ù| c3 b9|
|U+00FA| ú| c3 ba|
|U+00FB| û| c3 bb|
|U+00FC| ü| c3 bc|
|U+00FD| ý| c3 bd|
|U+00FE| þ| c3 be|
|U+00FF| ÿ| c3 bf |
> Spooling Directory Source doesn't handle files with large-ish event data
> ------------------------------------------------------------------------
>
> Key: FLUME-2241
> URL: https://issues.apache.org/jira/browse/FLUME-2241
> Project: Flume
> Issue Type: Bug
> Affects Versions: v1.4.0
> Environment: Debian 6.0.5
> Reporter: Viktor Trako
>
> I have a flume agent set up with a spooling directory source sinking data to
> cassandra.
> I'm collecting web data writing a line in the log file for each request then
> once the log file has been rotated is dropped into the spooling directory
> ready for flume to start processing it. All data is valid json as its
> validated prior to it being written to the log file.
> Sending a mixture of different sized requests from 9-15k seems fine.
> Generated a log file of over 400Mb and it all sinked correctly.
> I'm currently logging a 19k request and this is when things start to break.
> It only gets as far as 1800th request in the file and the next one is
> truncated.
> Changed the sink to a file-roll sink and it only gets as far as 29Mb
> I have profiled it and it's not running out of memory. I want to know if
> there are any limitations on the spooling directory source.
> Has anyone tried dropping a file with similarly large requests and
> experienced a similar issue.
> Any pointers would be greatly appreciated. My flume config is as follows
> {code:title=flume_conf|borderStyle=solid}
> orion.sources = spoolDir
> orion.channels = fileChannel
> orion.sinks= cassandra
> orion.channels.fileChannel.type = file
> orion.channels.fileChannel.capacity = 1000000
> orion.channels.fileChannel.transactionCapacity = 100
> orion.channels.fileChannel.keep-alive = 60
> orion.channels.fileChannel.write-timeout = 60
> orion.sinks.cassandra.type = com.btoddb.flume.sinks.cassandra.CassandraSink
> orion.sinks.cassandra.hosts = <cluster node ip>
> orion.sinks.cassandra.cluster_name = fake_cluster
> orion.sinks.cassandra.port = 9160
> orion.sinks.cassandra.keyspace-name = Keysp
> orion.sinks.cassandra.records-colfam = <table>
> orion.sources.spoolDir.type = spooldir
> orion.sources.spoolDir.spoolDir = /var/log/orion/flumeSpooling
> orion.sources.spoolDir.deserializer = LINE
> orion.sources.spoolDir.inputCharset = UTF-8
> orion.sources.spoolDir.deserializer.maxLineLength = 20000000
> orion.sources.spoolDir.deletePolicy = never
> orion.sources.spoolDir.batchSize = 100
> orion.sources.spoolDir.interceptors = addSrc addHost addTimestamp addUUID
> orion.sources.spoolDir.interceptors.addSrc.type = regex_extractor
> orion.sources.spoolDir.interceptors.addSrc.regex = \"service\"\:\"([^"]*)
> orion.sources.spoolDir.interceptors.addSrc.serializers = s1
> orion.sources.spoolDir.interceptors.addSrc.serializers.s1.name = src
> orion.sources.spoolDir.interceptors.addUUID.type = regex_extractor
> orion.sources.spoolDir.interceptors.addUUID.regex = \"uuid\"\:\"([^"]*)
> orion.sources.spoolDir.interceptors.addUUID.serializers = s1
> orion.sources.spoolDir.interceptors.addUUID.serializers.s1.name = key
> orion.sources.spoolDir.interceptors.addHost.type =
> org.apache.flume.interceptor.HostInterceptor$Builder
> orion.sources.spoolDir.interceptors.addHost.preserveExisting = false
> orion.sources.spoolDir.interceptors.addHost.useIP = true
> orion.sources.spoolDir.interceptors.addHost.hostHeader = host
> orion.sources.spoolDir.interceptors.addTimestamp.type = regex_extractor
> orion.sources.spoolDir.interceptors.addTimestamp.regex =
> \"timestamp\"\:\"([^"]*)
> orion.sources.spoolDir.interceptors.addTimestamp.serializers = s1
> orion.sources.spoolDir.interceptors.addTimestamp.serializers.s1.name =
> timestamp
> orion.sources.spoolDir.channels = fileChannel
> orion.sinks.cassandra.channel = fileChannel
> {code}
> Is this potentially a bug?.. If not tried can someone try to recreate - I
> hope the same error would occur.
> Dont hesitate to contact me for further info.
> Viktor
--
This message was sent by Atlassian JIRA
(v6.1#6144)