<snip>
Keith Turner wrote:
> Assuming batches were isolated from each other, and all batch/mutation
> flushes were controlled and done once per batch, is it difficult because
> the writes could be going to different tablet servers? Couldn't we keep
> track of which failed and have a choice of having a configurable internal
> retry (transient errors) or return the subset of mutations which failed and
> leave it up to the caller? This could work for us. We might want need some
> guarantees for a given row on the same server though - would have to think
> about that.
The batch writer does retry on network errors (until timeout is
reached, which defaults to max long or int). I think the only things
that percolate up to the users are unexpected exceptions in the batch
writer, tserver, or constraint violations. Are you interested in
knowing what mutations failed because of a timeout? I don't think
this can not be done w/o introducing a more expensive multi-step
protocol for writing data. Currently when the batch writer sends
data its possible that the tserver received it and wrote it, but could
not report success to the client. The client may either timeout or
send the data again.
It's trickier because server-side, we're also doing group-commits to the
WAL. Your update session (started by the BatchWriter) will make some
updates to the WAL and block on those to be sync'ed to the WAL. In this
sync, there may be updates to the WAL that include updates other than
your own.
That said, I'm not sure what the error conditions that Accumulo will
"normally" throw you such an error (e.g. not related to HDFS being hosed
or something). Maybe the HoldTimeException (tserver being too busy)? I'd
have to lock myself in a room and really take a good look at this stuff
again to refresh the cases where Accumulo might actually see an updated
but still send you an error... Maybe this isn't a concern to you as I'm
making it either :)