Hi,

I have 2 spark pipeline applications almost identical, but I found out a
significant difference between their performance.

Basically the 1st application consumes the streaming from Kafka, slice this
streaming in batches of 1 minute and for each record calculates a score
given the already loaded machine learning model and outputs the end results
with scores to a database.

But the 2nd application after 7 hours of continuously running ends up with
stopping and I observed each batch job had gotten longer to complete
compared with earlier batch jobs. This 2nd application besides consuming
the same streaming data as the 1st application, there are a couple of
additional steps related with records aggregation.

I'd like to ask here if as this records aggregation is the only diference
between both applications, this can explain the why my 2nd application is
getting gradually longer to complete it's streaming batch jobs.

I'd appreciate any help/clue or tip to help me understand what is going on
with this 2nd application.

Thank you,

Saulo

Reply via email to