Hi,
I have spent several hour digging into strange issue with DirectRunner,
that manifested as non-deterministic run of pipeline. The pipeline
contains basically only single stateful ParDo, which adds elements into
state and after some timeout flushes these elements into output. The
issues was, that sometimes (very often) when the timer fired, the state
appeared to be empty, although I actually added something into the
state. I will skip details, but the problem boils down to the fact, that
StateSpecs hash Coder into hashCode - e.g.
@Override
public int hashCode() {
return Objects.hash(getClass(), coder);
}
in ValueStateSpec. Now, when Coder doesn't have hashCode and equals
implemented (and there are some of those in the codebase itself - e.g.
SchemaCoder), it all blows up in a very hard-to-debug manner. So the
proposal is - either to add abstract hashCode and equals to Coder, or
don't hash the Coder into hashCode of StateSpecs (we can generate unique
ID for each StateSpec instance for example).
Any thoughts about which path to follow? Or maybe both? :)
Jan