On Feb 6, 2009, at 7:06 AM, Stefan Podkowinski wrote:

Another scenario I just recognized: what about current/"realtime"
data? E.g. 'select * from logs where date = today()'. Working with
'offset' may turn out to return different results after the table has
been updated and tasks are still pending. Pretty ugly to trace down
this condition, after you found out that sometimes your results are
just not right..

In fairness, this isn't an issue unique to streaming data into Hadoop from an RDBMS. The "today()" routine is non-functional -- it returns a different answer from the same arguments across multiple calls. Whenever you put that behavior into your computing infrastructure, you make it impossible to reproduce task behavior. This is a big issue in a system that uses speculative execution...

Simple answer is to avoid side effects. More complicated answer is to understand what they are, and think about them when you design your data processing flow.
                                                mike

Reply via email to