On Wed, Mar 29, 2023 at 5:33 AM Dave Marion <dmario...@gmail.com> wrote: > > > I think we should deprecate support for offline table scanning, since it > shouldn't be needed with the availability of ScanServers. > > Just making sure I understand your suggestion - you mean removing the > OfflineScanner and the ability to scan over offline tables in the MapReduce > code, but we should continue our efforts to allow Scan Servers to scan > offline tables, right?
Yes to removing OfflineScanner. But the rest of that isn't quite what I was thinking. What I was trying to say is that with elasticity features, unless immediate consistency is required, the ScanServer's ability to scan should not depend on the tablets being "hosted" for live ingest. Using the ScanServer on the table's "unhosted" tablets is enough to replace the need for the OfflineScanner, I think. So, yes, we should continue our efforts to allow ScanServers to scan tables with "unhosted" tablets. Now, whether we say that is the ScanServer scanning an "offline" table or not, depends on how we're defining "online" and "offline". Currently, without elastic features in place, that would only happen if we mark the table in an "offline" state, but once all the elastic features are in place, I think this would still be considered "ondemand" or "online, as in available for use, but not pinned for live ingest / unhosted". A lot of this is more about how we communicate the state (naming, concepts, etc.), and depends on the rest of my email, rather than affecting the actual features we're supporting. We should still plan for ScanServer to scan "unhosted" tablets, regardless of what state we end up calling it. > > > As for "ondemand" table state, from a user perspective, I'm not sure what > it mean > > I have been thinking about it as "online" means always hosted, "ondemand" > means hosted as needed, and "offline" means never hosted. Rather than have a mapping from what these mean to how they behave, I think it would be better to have the names directly reflect the user experience. If we say "online" means "always hosted", then just call it "hosted". I think we really need the following states to match to the user experience: (online, live) (online, live-on-demand) (online) (offline / immutable) But, I think the first three states should really just be considered one state, with the "live"-ness being configurable. > > > is the "on-demand availability" applicable only for live ingest / > immediate consistency? Is it still "always available"for bulk import / > ScanServers? Or does "on-demand availability" somehow apply to all > interactions, including bulk import and ScanServer reads? > > We tried to reason about that in > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=247828052 I think that matrix is useful as we iron out the implementation, but I don't think users should have to consult such a table in order to understand what they can and cannot do in a given state. That's why I think it's useful to have just the two states, with pretty much everything working in the "online" state and nothing working in the "offline" state, with the configurable "live"-ness that unlocks additional features (specifically, immediate consistency scanning and live ingest). Like buying a car, you get standard (online) or nothing (offline), or you can get standard plus extras by paying the cost of those extras (in this case, configuration of "live"-ness). It's a bit weird to be able to do things in an offline state, or to not be able to do things in an online state. But if it's framed as "configure the extra features to use them", it's a bit more intuitive to understand, because "online" still means "ready to go". If it helps make things even more clear, long before we had the "online" and "offline" states, I worked on an idea for calling the tables "enabled" and "disabled". I abandoned that idea (and the transition states that accompanied them) when these states were introduced. But, I still think those terms might be better, in that they don't imply any relationship to "hosted" or "unhosted"... just whether or not they were usable by the user, which is a better way of framing things, I think. If enabled/disabled terms were used in place of online/offline, the states could be: (enabled, hosted) (enabled, on-demand) (enabled, unhosted) (disabled) Or collapsing the first 3 again, just: (enabled) - hosted status is tunable/configurable (disabled) > > Regarding the rest of your email, I think removing the ondemand state would > be ok. The ondemand commits added a new property for the user to specify > which tablet unloader class[1] to use, with the default being [2]. We could > add a new default implementation that does not unload and users would have > to opt-in to unloading by setting the property for their online tables. > However this is some code surrounding the new ondemand state that we would > need to address. For example, when a TabletServer is low on memory it > doesn't call the specified TabletUnloader, it just unloads a Tablet. > > [1] > https://github.com/apache/accumulo/blob/elasticity/core/src/main/java/org/apache/accumulo/core/spi/ondemand/OnDemandTabletUnloader.java > [2] > https://github.com/apache/accumulo/blob/elasticity/core/src/main/java/org/apache/accumulo/core/spi/ondemand/DefaultOnDemandTabletUnloader.java > > On Tue, Mar 28, 2023 at 10:27 AM Christopher <ctubb...@apache.org> wrote: > > > I think we should deprecate support for offline table scanning, since > > it shouldn't be needed with the availability of ScanServers. Any > > MapReduce that previously relied on scanning offline tables could be > > made to use that instead. > > > > I agree there is a need to have an immutable table state, for which it > > is possible to read, but no changes can be made. However, even in that > > "locked" state, one should still be able to perform surgery on its > > metadata, or manually / surgically compact files (with the > > understanding that doing so will interfere with any concurrent export > > or scan operations that are relying on it being immutable, which I > > think is a tolerable amount of risk, when actually in a situation > > where such surgery is needed). > > > > As for "ondemand" table state, from a user perspective, I'm not sure > > what it means... is the "on-demand availability" applicable only for > > live ingest / immediate consistency? Is it still "always available" > > for bulk import / ScanServers? Or does "on-demand availability" > > somehow apply to all interactions, including bulk import and > > ScanServer reads? > > > > I think the "ondemand" state is confusing, because it's exposing > > internal state through to the user, and in a way that isn't as clear > > as the simple "online/offline" states used to be. Previously, users > > didn't need to understand what was going on internally... "online" > > just meant "I can interact with this table", and "offline" meant "I > > can't interact with this table". The user wasn't required to > > understand what a tablet was, or how it was hosted, or anything of > > that nature. As we started adding support for "offline" features, the > > lines separating "online and offline" meaning "available and > > unavailable" became blurred. As we proceed adding elasticity, I think > > we should work to make things more clear and explicit again... and I > > think "ondemand" as a table state, makes things even less clear when > > the concept is exposed to the user as a separate table state. > > > > I do think we need some kind of on-demand availability for live-ingest > > and immediate consistency in order to be more elastic, and from the > > discussion, it's obvious we need an immutable table state, but I think > > it's a mistake to expose the on-demand availability for live-ingest > > and immediate consistency as a new table state. I think that should be > > left as either some kind of automatic internal behavior, or as a > > secondary fine-grained control over an online table (like pinned > > tablets, either permanently pinned or temporally pinned, based on > > activity). > > > > On Tue, Mar 28, 2023 at 9:51 AM Drew Farris <d...@apache.org> wrote: > > > > > > On Mon, Mar 27, 2023 at 2:16 PM Keith Turner <ke...@deenlo.com> wrote: > > > > > > > One realization that came out examining the different table states is > > > > that export table currently relies on the fact that offline tables > > > > will not delete files. If we enable compactions on offline tables > > > > then that could cause files to be deleted which would break the > > > > expectation of export table. > > > > > > > > > > This is a good point. I hadn't considered the potential breakage to > > export > > > table. I suspect another concern could be the hadoop input format that > > > operates over the rfiles in an offline table - and can do so relatively > > > safely > > > because the table is not expected to change while it is offline. > > > > > > So, it would seem that there is value in having an 'immutable' table > > state > > > in > > > the form of an offline table. Perhaps 'ondemand' is the alternate state > > > that > > > lets us do things like import, split, compact, merge, etc. > >