ctubbsii commented on PR #2665: URL: https://github.com/apache/accumulo/pull/2665#issuecomment-1154733877
@keith-turner wrote: > > It's not the scan execution hints that are modifying the behavior... it's the configured dispatcher. And, the scan hints are still not affecting the data returned... it's the server that it was dispatched to that is doing that. > > If scan hints+config can change the behavior of a scanner from immediate to eventual I think this could lead to disaster. Consider something like the Accumulo GC algorithm where its correctness relies on only using scanners with immediate consistency. Consider the following situation. > > * Person A writes a scanner that requires immediate consistency and sets a scan hint with intention of changing cache behavior to be opportunistic. > * Later Person B changes Accumulo configuration such that it causes the scan hints set by person A to now make the scanner coded by person A be eventually consistent. I think it's an exaggeration to call this a disaster. Scan hints controlling a specific configured dispatcher's behavior should already be documented in that dispatcher's documentation and stable before users can rely on it for stable behavior. This is not a problem. If we change the semantics of *any* configuration, we can break things users intended with their previous configuration. This situation is no different... scan hints are just configuration for a dispatcher that are semantically constrained to that dispatcher's documented behavior. Having this as behavior with an explicit API method to configure isn't any different. A configured dispatcher could just ignore that configuration and dispatch to an eventually consistent ScanServer instead of a TabletServer. Hints are just another kind of configuration. Whether that configuration is set by an API with a different name, or set by the API that sets hints, we're in the same situation... we have to trust the dispatcher that the user has configured to do the thing we expect it to, based on whatever configuration is set on the scan task, regardless of how it is set. The main difference here, is that it already logically makes sense to use scan hints to modify the dispatcher behavior, because that's what that configuration is *for*. > > If the code in question were the Accumulo GC, this could cause files to be deleted when they should not be. Eventual vs immediate consistency is so important to some algorithms that it should always be explicitly declared per scanner and never be overridden by config that could impact all scanners in an indiscriminate manner without consideration of individual circumstances and per scanner intent. I think the discussion of the accumulo-gc is a bit of a red herring. That scans the metadata. It is already well documented that all metadata scans are always dispatched to an executor named "meta", and should always be immediately consistent. Even it if it wasn't, though, I don't think by setting the scan configuration via executor hints is substantially different than setting other scan configuration via other APIs. It's all configuration, and the dispatcher's behavior ultimately has to be documented, known, and relied upon in order to get any kind of guarantees about any scan results. @dlmarion wrote: > Since immediate/strict consistency is the default, maybe we just need a method to disable it for a specific query instead of specifying the value. For example, `enableEventualConsistency()`, `relaxReadGuarantees()`, `disableConsistentReads()`, `allowStaleReads()`, etc. > > I'm also thinking that there should be a table configuration that enables/disables this feature. Currently, an admin can spin up some ScanServers and an application developer can enable eventual consistency, but do we want that on the `metadata` table for example? I would like to keep configuration simple. I've read so many articles about software complexity killing projects, and I think Accumulo is already in that risky area, where every new complex feature we add, often for niche use cases, adds an obtuse amount of complexity. We already have an overwhelming amount of single-purpose configuration elements that micro-manage so many elements of Accumulo's operations. We have an opportunity here to keep things *simple*. The dispatcher is already one such configurable component. If the dispatcher is responsible for deciding which server to use, and we already have a way to pass configuration to a dispatcher through the scan hints, then I don't see why we need to have additional configuration that add to the bloat. Let's be modular... let's let the configurable dispatcher to the work. We can add this feature without any additional user facing complexity... if we recognize that scan hints are merely dispatcher configuration, and the dispatcher is already a pluggable module, and all of these configurations are already per-table or per-scan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
