[
https://issues.apache.org/jira/browse/FLUME-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952319#comment-14952319
]
Jeff Holoman commented on FLUME-2760:
-------------------------------------
[~mqchrisw] Did you manage to get this sorted out?
> Flafka configuration seems unclear and I get either serialized data for
> serialized schema in HDFS file.
> -------------------------------------------------------------------------------------------------------
>
> Key: FLUME-2760
> URL: https://issues.apache.org/jira/browse/FLUME-2760
> Project: Flume
> Issue Type: Question
> Components: Sinks+Sources
> Affects Versions: v1.6.0
> Environment: Redhat
> Reporter: Chris Weaver
>
> I am attempting to pull data from a Confluent(Kafka) stream where the
> messages are snappy compressed avro byte[] and push to an HDFS file
> containing the avro data with schema.
>
> First attempt resulted in pulling the data and creating files with serialized
> avro data but no schema at the top of the file:
> tier1.sources = source1
> tier1.channels = channel1
> tier1.sinks = sink1
> tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
> tier1.sources.source1.zookeeperConnect = zk:2181
> tier1.sources.source1.topic = mytopicX
> tier1.sources.source1.groupId = mygroupid
> tier1.sources.source1.batchSize = 100
> tier1.sources.source1.batchDurationMillis = 1000
> tier1.sources.source1.kafka.consumer.timeout.ms = 100
> tier1.sources.source1.auto.commit.enable=false
> tier1.sources.source1.channels = channel1
>
> tier1.channels.channel1.type = memory
> tier1.channels.channel1.capacity = 10000
> tier1.channels.channel1.transactionCapacity = 10000
>
> tier1.sinks.sink1.type = hdfs
> tier1.sinks.sink1.hdfs.path = /some/where/useful
> tier1.sinks.sink1.hdfs.rollInterval = 300
> tier1.sinks.sink1.hdfs.rollSize = 134217728
> tier1.sinks.sink1.hdfs.rollCount = 0
> tier1.sinks.sink1.hdfs.fileType = DataStream
> tier1.sinks.sink1.hdfs.writeFormat = Text
> tier1.sinks.sink1.channel = channel1
> <try many different configs and finally create custom serializer>
> Nth attempt I was able to create the file with the serialized schema at the
> top of the file but no data in the file. It had errors in the logs indicating
> the the schema I provided has not able to union with the data messages, the
> only difference was in the sink config -
> tier1.sinks.sink1.type = hdfs
> tier1.sinks.sink1.hdfs.path = /some/where/useful
> tier1.sinks.sink1.hdfs.rollInterval = 300
> tier1.sinks.sink1.hdfs.rollSize = 134217728
> tier1.sinks.sink1.hdfs.rollCount = 0
> tier1.sinks.sink1.hdfs.fileType = DataStream
> tier1.sinks.sink1.hdfs.writeFormat = Text
> tier1.sinks.sink1.serializer = MySerializer$Builder # loaded to the flume
> machine
> tier1.sinks.sink1.channel = channel1
> and the error given in the logs with this configuration:
> Caused by: org.apache.avro.UnresolvedUnionException: Not in union [ my very
> large schema goes here ]: [Event headers = {topic=mytopicX,
> key=df84dcd6-a801-477c-a0d8-aa7e526b672d, timestamp=1438883746194},
> body.length = 383 ]
> Is there a config or method of getting your schema to be part of the output
> HDFS file that I am missing. With respect to the custom serializer I am
> simply extending the AbstractAvroEventSerializer<Event> and really just
> creating the schema for my data.
> Any help, docs or recommendations would much appreciated as I am stuck on
> this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)