[ 
https://issues.apache.org/jira/browse/FLUME-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504947#comment-14504947
 ] 

Edward Sargisson commented on FLUME-2390:
-----------------------------------------

[~bfiorini] your analysis is correct but I would disagree with the solution.
We use the [~rore] solution ourselves and generate an ID as early in the 
pipeline as we can manage. This means that our system can use Last Write Wins 
and overwrite any record with that ID.

As for #2. I think that if you've got a mapping problem then you need to fix 
the mapping problem. Sadly, Flume has the head of line blocking problem (aka 
poison pill) so everything blocks. A more general solution is to have a dead 
letter queue.

In dev we simply delete the queue. In production like environments when we want 
to practice keeping the data we:
0. backup the file channel directories 
1. set the batch size to 1 and let all the events through until it blocks 
(otherwise you may have 100's of good events ahead of the poison pill).
2. use the FileRollSink to flush the queue out to a file (remember to turn 
headers on)
3. then we can look at the first event and figure out why it blocked
4. then we can either fix the problem in es's mappings, restore the file 
channel and let it run or, in theory, take the poison pill out and re-ingest 
the events.

Yes, all of that is a complete pain!

> Flume-ElasticSearch Data gets posted multiple times when one of the event 
> fail validation at elastic search sink for JSON Data
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLUME-2390
>                 URL: https://issues.apache.org/jira/browse/FLUME-2390
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.4.0
>         Environment: CDH4.5
>            Reporter: Deepak Subhramanian
>
> Hi,
> I am using Elastic Search Sink to post JSON data. I used the temporary fix 
> mentioned in https://issues.apache.org/jira/browse/FLUME-2126 to get JSON 
> data posted to elastic search. When one of the message fail validation at 
> ElasticSearch mapping for JSON data ( For example - getting empty message) , 
> Flume seems to post the entire batch again and again until I restart Flume.  
> Because of that no of events went from an avg of 100 to avg of 2000 per 10 
> minutes. As a temporary fix I set a header in my FlumeHTTP Source for non 
> valid JSON and used a interceptor to send data to multiple ESSINKS which has 
> different index names. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to