ctubbsii commented on PR #2665:
URL: https://github.com/apache/accumulo/pull/2665#issuecomment-1154733877

   @keith-turner wrote:
   > > It's not the scan execution hints that are modifying the behavior... 
it's the configured dispatcher. And, the scan hints are still not affecting the 
data returned... it's the server that it was dispatched to that is doing that.
   > 
   > If scan hints+config can change the behavior of a scanner from immediate 
to eventual I think this could lead to disaster. Consider something like the 
Accumulo GC algorithm where its correctness relies on only using scanners with 
immediate consistency. Consider the following situation.
   > 
   > * Person A writes a scanner that requires immediate consistency and sets a 
scan hint with intention of changing cache behavior to be opportunistic.
   > * Later Person B changes Accumulo configuration such that it causes the 
scan hints set by person A to now make the scanner coded by person A be 
eventually consistent.
   
   I think it's an exaggeration to call this a disaster. Scan hints controlling 
a specific configured dispatcher's behavior should already be documented in 
that dispatcher's documentation and stable before users can rely on it for 
stable behavior. This is not a problem. If we change the semantics of *any* 
configuration, we can break things users intended with their previous 
configuration. This situation is no different... scan hints are just 
configuration for a dispatcher that are semantically constrained to that 
dispatcher's documented behavior.
   
   Having this as behavior with an explicit API method to configure isn't any 
different. A configured dispatcher could just ignore that configuration and 
dispatch to an eventually consistent ScanServer instead of a TabletServer. 
Hints are just another kind of configuration. Whether that configuration is set 
by an API with a different name, or set by the API that sets hints, we're in 
the same situation... we have to trust the dispatcher that the user has 
configured to do the thing we expect it to, based on whatever configuration is 
set on the scan task, regardless of how it is set.
   
   The main difference here, is that it already logically makes sense to use 
scan hints to modify the dispatcher behavior, because that's what that 
configuration is *for*.
   
   > 
   > If the code in question were the Accumulo GC, this could cause files to be 
deleted when they should not be. Eventual vs immediate consistency is so 
important to some algorithms that it should always be explicitly declared per 
scanner and never be overridden by config that could impact all scanners in an 
indiscriminate manner without consideration of individual circumstances and per 
scanner intent.
   
   I think the discussion of the accumulo-gc is a bit of a red herring. That 
scans the metadata. It is already well documented that all metadata scans are 
always dispatched to an executor named "meta", and should always be immediately 
consistent. Even it if it wasn't, though, I don't think by setting the scan 
configuration via executor hints is substantially different than setting other 
scan configuration via other APIs. It's all configuration, and the dispatcher's 
behavior ultimately has to be documented, known, and relied upon in order to 
get any kind of guarantees about any scan results.
   
   
   @dlmarion wrote:
   > Since immediate/strict consistency is the default, maybe we just need a 
method to disable it for a specific query instead of specifying the value. For 
example, `enableEventualConsistency()`, `relaxReadGuarantees()`, 
`disableConsistentReads()`, `allowStaleReads()`, etc.
   > 
   > I'm also thinking that there should be a table configuration that 
enables/disables this feature. Currently, an admin can spin up some ScanServers 
and an application developer can enable eventual consistency, but do we want 
that on the `metadata` table for example?
   
   I would like to keep configuration simple. I've read so many articles about 
software complexity killing projects, and I think Accumulo is already in that 
risky area, where every new complex feature we add, often for niche use cases, 
adds an obtuse amount of complexity. We already have an overwhelming amount of 
single-purpose configuration elements that micro-manage so many elements of 
Accumulo's operations. We have an opportunity here to keep things *simple*. The 
dispatcher is already one such configurable component. If the dispatcher is 
responsible for deciding which server to use, and we already have a way to pass 
configuration to a dispatcher through the scan hints, then I don't see why we 
need to have additional configuration that add to the bloat. Let's be 
modular... let's let the configurable dispatcher to the work. We can add this 
feature without any additional user facing complexity... if we recognize that 
scan hints are merely dispatcher configuration, and the dispatcher 
 is already a pluggable module, and all of these configurations are already 
per-table or per-scan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to