[
https://issues.apache.org/jira/browse/FLINK-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197028#comment-16197028
]
Ajay commented on FLINK-7782:
-----------------------------
Here is an example of a simple event I am trying to detect. The first and last
line are the interesting points/events. The CEP library is not able to detect
something like that.
4> (96,Sat Sep 30 22:30:25 UTC 2017,complex
event,Low,32.781082,-117.01864,12.0,20.0)
4> (96,Sat Sep 30 22:30:26 UTC 2017,complex
event,High,32.781082,-117.01864,0.0235,20.0)
4> (96,Sat Sep 30 22:30:27 UTC 2017,complex
event,High,32.781082,-117.01864,0.02319611,20.0)
4> (96,Sat Sep 30 22:30:28 UTC 2017,complex
event,Medium,32.781082,-117.01864,0.023357224,20.0)
4> (96,Sat Sep 30 22:30:29 UTC 2017,complex
event,Low,32.781082,-117.01864,0.060904443,20.0)
4> (96,Sat Sep 30 22:30:30 UTC 2017,complex
event,Medium,32.781082,-117.01864,0.100115,20.0)
4> (96,Sat Sep 30 22:30:31 UTC 2017,complex
event,High,32.781082,-117.01864,0.12398389,20.0)
4> (96,Sat Sep 30 22:30:32 UTC 2017,complex
event,Medium,32.781082,-117.01864,0.15611167,20.0)
4> (96,Sat Sep 30 22:30:33 UTC 2017,complex
event,Low,32.781082,-117.01864,0.15817556,20.0)
4> (96,Sat Sep 30 22:30:34 UTC 2017,complex
event,Low,32.781082,-117.01864,0.09934334,20.0)
4> (96,Sat Sep 30 22:30:35 UTC 2017,complex
event,High,32.781082,-117.01864,16.0,20.0)
Notes about this experiment.
1. Only one kafka partition and just one topic
2. Flink env parallelism set to 4 and I am using AscendingTimestampExtractor on
KafkaSource09.
3. In the data above, the first element is the id that I use for keyBy
4. I started 4 Kafka producers in parallel with a random delay between them
5. Each producer sends 10000 rows from a csv at an average of 18 seconds. Of
the data from 4 producers, the events for only one was detected.
6. Looking at the log files, I print on the stream and see all 40000 lines
where each id is associated with one process number. In the above data 96 is
only associated with 4. In this case there is just one partition in Kafka. If I
were to increase the number of partitions each id is spread across multiple
processes.
7. I had ran another test with a different set of 4 ids just before the one
I've presented above and I expected to see 148 events for 4 ids and I saw all
of them being captured. I did not change anything as far as delays in the
producer.
> Flink CEP not recognizing pattern
> ---------------------------------
>
> Key: FLINK-7782
> URL: https://issues.apache.org/jira/browse/FLINK-7782
> Project: Flink
> Issue Type: Bug
> Reporter: Ajay
>
> I am using flink version 1.3.2. Flink has a kafka source. I am using
> KafkaSource9. I am running Flink on a 3 node AWS cluster with 8G of RAM
> running Ubuntu 16.04. From the flink dashboard, I see that I have 2
> Taskmanagers & 4 Task slots
> What I observe is the following. The input to Kafka is a json string and when
> parsed on the flink side, it looks like this
> {code:java}
> (101,Sun Sep 24 23:18:53 UTC 2017,complex
> event,High,37.75142,-122.39458,12.0,20.0)
> {code}
> I use a Tuple8 to capture the parsed data. The first field is home_id. The
> time characteristic is set to EventTime and I have an
> AscendingTimestampExtractor using the timestamp field. I have parallelism for
> the execution environment is set to 4. I have a rather simple event that I am
> trying to capture
> {code:java}
> DataStream<Tuple8<Integer,Date,String,String,Float,Float,Float, Float>>
> cepMapByHomeId = cepMap.keyBy(0);
> //cepMapByHomeId.print();
>
> Pattern<Tuple8<Integer,Date,String,String,Float,Float,Float,Float>, ?> cep1 =
>
> Pattern.<Tuple8<Integer,Date,String,String,Float,Float,Float,Float>>begin("start")
> .where(new OverLowThreshold())
> .followedBy("end")
> .where(new OverHighThreshold());
> PatternStream<Tuple8<Integer, Date, String, String, Float, Float,
> Float, Float>> patternStream = CEP.pattern(cepMapByHomeId, cep1);
> DataStream<Tuple7<Integer, Date, Date, String, String, Float,
> Float>> alerts = patternStream.select(new PackageCapturedEvents());
> {code}
> The pattern checks if the 7th field in the tuple8 goes over 12 and then over
> 16. The output of the pattern is like this
> {code:java}
> (201,Tue Sep 26 14:56:09 UTC 2017,Tue Sep 26 15:11:59 UTC 2017,complex
> event,Non-event,37.75837,-122.41467)
> {code}
> On the Kafka producer side, I am trying send simulated data for around 100
> homes, so the home_id would go from 0-100 and the input is keyed by home_id.
> I have about 10 partitions in kafka. The producer just loops going through a
> csv file with a delay of about 100 ms between 2 rows of the csv file. The
> data is exactly the same for all 100 of the csv files except for home_id and
> the lat & long information. The timestamp is incremented by a step of 1 sec.
> I start multiple processes to simulate data form different homes.
> THE PROBLEM:
> Flink completely misses capturing events for a large subset of the input
> data. I barely see the events for about 4-5 of the home_id values. I do a
> print before applying the pattern and after and I see all home_ids before and
> only a tiny subset after. Since the data is exactly the same, I expect all
> homeid to be captured and written to my sink.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)