Hi fellow flummers,

I am struggling with flume for a couple of weeks, I am trying to log events to Amazon S3 so later I could Use Amazon EMR to analyze the events.
The architecture I am trying to build is:

The client posts data bziped -> a end point decompresses the data and attach extra data (like http headers)-> writes the data to a local file system file -> flume agent tails that file -> send the events to a flume collector -> the flume collector send the file to S3 bzipped

After some effort I made this architecture working for small events, the problem is the events I should store are large (72kb expanded) and I have no control over the client (the client writes large zipped XML files and I cann't change this behavior), so this architecture should be able to deal with this kind of events.

So I was thinking of two approaches, and I wanted share them with you, and to hear what you can say

1. Flume supports 32kb event size, but can support larger events by changing the "flume.event.max.size.bytes" property, I tried to do that, but:
    a. I am afraid of the performance issue
b. It didn't work well, it seems like the events, it writes are trimmed, and also it writes them infinitely.

2. Fluming the event bziped (not decompressing it on the endpoint) to S3, and decompressing it with the EMR later. In that case:
   a. What is the format I should store the events?
   b. How would I enrich the data with the request headers?


Thanks for time.



Guy Doulberg

Reply via email to