Re: [Gluster-devel] Data classification proposal
> Sounds like a metadata server would fix this! > > ( Yes, this is trolling hard. Ignore. ;> ) Fortunately I'm on vacation now, so my head didn't explode. ;) ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
On 27/06/2014, at 8:39 AM, Xavier Hernandez wrote: > On Thursday 26 June 2014 12:52:13 Dan Lambright wrote: >> I don't think brick splitting implemented by LVM would affect directory >> browsing any more than adding an additional brick would, >> > > Yes, splitting a brick in LVM should be the same than adding a normal brick. > The main problem I see is that adding normal bricks decrease the browsing > speed, so splitting bricks will also degrade it. > > I've seen a configuration with only 14 bricks (7 replica-2 sets) where > browsing was not possible: directory listings with no more than a few > hundreds > of files took up to a minute or even more if the directory wasn't accessed > for > a long time. This is not usable. > > This wasn't a hardware problem: servers had 2 CPU's with 6 cores each and > hyperthreading (total 24 cores), 64 GB of RAM and Infiniband network. File > system was formated using XFS. > > I fear what can happen if the number of bricks grow considerably by splitting > without solving this problem before... Sounds like a metadata server would fix this! ( Yes, this is trolling hard. Ignore. ;> ) + Justin -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
On 06/27/2014 12:46 AM, Shyamsundar Ranganathan wrote: Wanted to add to the thought process a different angle towards thinking about the data classified volumes. One of the reasons for classifying data (be it tiering or others, like high profile users to high profile storage backends), is to deal with its (i.e data) protection differently. With the current model as we discuss presenting the entire volume for consumption by clients to the file system, we should think about clients like backup, where the backup policy for a sub volume could differ from the backup policy for another (or say geo replication instead of backup). I would think, other such use cases/clients would need to view parts of the volume and not the whole, when attempting to perform their function. For example in the backup case, the fast tier could be backed up daily and the slow tier could be backed up weekly, in which case one would need volume graphs that split this view for the client in question. Agreed, the proposal sent by Joseph Fernandes a couple of days back suggests something similar. You might want to look at the presentation sent by him. Subject line being "Proposal for Gluster Compliance Feature" Regards, Vivek Just a thought. Shyam - Original Message - From: "Dan Lambright" To: "Jeff Darcy" Cc: "Gluster Devel" Sent: Monday, June 23, 2014 4:48:13 PM Subject: Re: [Gluster-devel] Data classification proposal A frustrating aspect of Linux is the complexity of /etc configuration file's formats (rsyslog.conf, logrotate, cron, yum repo files, etc) In that spirit I would simplify the "select" in the data classification proposal (copied below) to only accept a list of bricks/sub-tiers with wild-cards '*', rather than full-blown regular expressions or key/value pairs. I would drop the "unclaimed" keyword, and not have keywords "media type", and "rack". It does not seem necessary to introduce new keys for the underlying block device type (SSD vs disk) any more than we need to express the filesystem (XFS vs ext4). In other words, I think tiering can be fully expressed in the configuration file while still abstracting the underlying storage. That said, the configuration file could be built up by a CLI or GUI, and richer expressibility could exist at that level. example: brick host1:/brick ssd-group0-1 brick host2:/brick ssd-group0-2 brick host3:/brick disk-group0-1 rule tier-1 select ssd-group0* rule tier-2 select disk-group0 rule all select tier-1 # use repeated "select" to establish order select tier-2 type features/tiering The filtering option's regular expressions seem hard to avoid. If just the name of the file satisfies most use cases (that we know of?) I do not think there is any way to avoid regular expressions in the option for filters. (Down the road, if we were to allow complete flexibility in how files can be distributed across subvolumes, the filtering problems may start to look similar to 90s-era packet classification with a solution along the lines of the Berkeley packet filter.) There may be different rules by which data is distributed at the "tiering" level. For example, one tiering policy could be the fast tier (first listed). It would be a "cache" for the slow tier (second listed). I think the "option" keyword could handle that. rule all select tier-1 # use repeated "select" to establish order select tier-2 type features/tiering option tier-cache, mode=writeback, dirty-watermark=80 Another example tiering policy could be based on compliance ; when a file needs to become read-only, it moves from the first listed tier to the second. rule all select tier-1 # use repeated "select" to establish order select tier-2 type features/tiering option tier-retention - Original Message - From: "Jeff Darcy" To: "Gluster Devel" Sent: Friday, May 23, 2014 3:30:39 PM Subject: [Gluster-devel] Data classification proposal One of the things holding up our data classification efforts (which include tiering but also other stuff as well) has been the extension of the same conceptual model from the I/O path to the configuration subsystem and ultimately to the user experience. How does an administrator define a tiering policy without tearing their hair out? How does s/he define a mixed replication/erasure-coding setup without wanting to rip *our* hair out? The included Markdown document attempts to remedy this by proposing one out of many possible models and user interfaces. It includes examples for some of the most common use cases, including the "replica 2.5" case we'e been discussing recently. Constructive feedback would be greatly appreciated.
Re: [Gluster-devel] Data classification proposal
On Thursday 26 June 2014 12:52:13 Dan Lambright wrote: > I don't think brick splitting implemented by LVM would affect directory > browsing any more than adding an additional brick would, > Yes, splitting a brick in LVM should be the same than adding a normal brick. The main problem I see is that adding normal bricks decrease the browsing speed, so splitting bricks will also degrade it. I've seen a configuration with only 14 bricks (7 replica-2 sets) where browsing was not possible: directory listings with no more than a few hundreds of files took up to a minute or even more if the directory wasn't accessed for a long time. This is not usable. This wasn't a hardware problem: servers had 2 CPU's with 6 cores each and hyperthreading (total 24 cores), 64 GB of RAM and Infiniband network. File system was formated using XFS. I fear what can happen if the number of bricks grow considerably by splitting without solving this problem before... Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
Wanted to add to the thought process a different angle towards thinking about the data classified volumes. One of the reasons for classifying data (be it tiering or others, like high profile users to high profile storage backends), is to deal with its (i.e data) protection differently. With the current model as we discuss presenting the entire volume for consumption by clients to the file system, we should think about clients like backup, where the backup policy for a sub volume could differ from the backup policy for another (or say geo replication instead of backup). I would think, other such use cases/clients would need to view parts of the volume and not the whole, when attempting to perform their function. For example in the backup case, the fast tier could be backed up daily and the slow tier could be backed up weekly, in which case one would need volume graphs that split this view for the client in question. Just a thought. Shyam - Original Message - > From: "Dan Lambright" > To: "Jeff Darcy" > Cc: "Gluster Devel" > Sent: Monday, June 23, 2014 4:48:13 PM > Subject: Re: [Gluster-devel] Data classification proposal > > A frustrating aspect of Linux is the complexity of /etc configuration file's > formats (rsyslog.conf, logrotate, cron, yum repo files, etc) In that spirit > I would simplify the "select" in the data classification proposal (copied > below) to only accept a list of bricks/sub-tiers with wild-cards '*', rather > than full-blown regular expressions or key/value pairs. I would drop the > "unclaimed" keyword, and not have keywords "media type", and "rack". It does > not seem necessary to introduce new keys for the underlying block device > type (SSD vs disk) any more than we need to express the filesystem (XFS vs > ext4). In other words, I think tiering can be fully expressed in the > configuration file while still abstracting the underlying storage. That > said, the configuration file could be built up by a CLI or GUI, and richer > expressibility could exist at that level. > > example: > > brick host1:/brick ssd-group0-1 > > brick host2:/brick ssd-group0-2 > > brick host3:/brick disk-group0-1 > > rule tier-1 > select ssd-group0* > > rule tier-2 > select disk-group0 > > rule all > select tier-1 > # use repeated "select" to establish order > select tier-2 > type features/tiering > > The filtering option's regular expressions seem hard to avoid. If just the > name of the file satisfies most use cases (that we know of?) I do not think > there is any way to avoid regular expressions in the option for filters. > (Down the road, if we were to allow complete flexibility in how files can be > distributed across subvolumes, the filtering problems may start to look > similar to 90s-era packet classification with a solution along the lines of > the Berkeley packet filter.) > > There may be different rules by which data is distributed at the "tiering" > level. For example, one tiering policy could be the fast tier (first > listed). It would be a "cache" for the slow tier (second listed). I think > the "option" keyword could handle that. > > rule all > select tier-1 ># use repeated "select" to establish order > select tier-2 > type features/tiering > option tier-cache, mode=writeback, dirty-watermark=80 > > Another example tiering policy could be based on compliance ; when a file > needs to become read-only, it moves from the first listed tier to the > second. > > rule all >select tier-1 ># use repeated "select" to establish order >select tier-2 >type features/tiering > option tier-retention > > - Original Message - > From: "Jeff Darcy" > To: "Gluster Devel" > Sent: Friday, May 23, 2014 3:30:39 PM > Subject: [Gluster-devel] Data classification proposal > > One of the things holding up our data classification efforts (which include > tiering but also other stuff as well) has been the extension of the same > conceptual model from the I/O path to the configuration subsystem and > ultimately to the user experience. How does an administrator define a > tiering policy without tearing their hair out? How does s/he define a mixed > replication/erasure-coding setup without wanting to rip *our* hair out? The > included Markdown document attempts to remedy this by proposing one out of > many possible models and user interfaces. It includes examples for some of > the most common use cases, including the "replica 2.5" case w
Re: [Gluster-devel] Data classification proposal
I don't think brick splitting implemented by LVM would affect directory browsing any more than adding an additional brick would, - Original Message - From: "Justin Clift" To: "Dan Lambright" Cc: "Shyamsundar Ranganathan" , "Gluster Devel" Sent: Thursday, June 26, 2014 12:01:16 PM Subject: Re: [Gluster-devel] Data classification proposal On 26/06/2014, at 4:54 PM, Dan Lambright wrote: > Implementing brick splitting using LVM would allow you to treat each logical > volume (split) as an independent brick. Each split would have its own > .glusterfs subdirectory. I think this would help with taking snapshots as > well. Would brick splitting make directory browsing latency even scarier? + Justin -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
Implementing brick splitting using LVM would allow you to treat each logical volume (split) as an independent brick. Each split would have its own .glusterfs subdirectory. I think this would help with taking snapshots as well. - Original Message - From: "Shyamsundar Ranganathan" To: "Krishnan Parthasarathi" Cc: "Gluster Devel" Sent: Thursday, June 26, 2014 11:13:48 AM Subject: Re: [Gluster-devel] Data classification proposal > > > For the short-term, wouldn't it be OK to disallow adding bricks that > > > is not a multiple of group-size? > > > > In the *very* short term, yes. However, I think that will quickly > > become an issue for users who try to deploy erasure coding because those > > group sizes will be quite large. As soon as we implement tiering, our > > very next task - perhaps even before tiering gets into a release - > > should be to implement automatic brick splitting. That will bring other > > benefits as well, such as variable replication levels to handle the > > sanlock case, or overlapping replica sets to spread a failed brick's > > load over more peers. > > > > OK. Do you have some initial ideas on how we could 'split' bricks? I ask this > to see if I can work on splitting bricks while the data classification format > is > being ironed out. I see split bricks as creating a logical space for the new aggregate that the brick belongs to. This may not need data movement etc. but just a logical branching at the root of the brick for its membership. Are there counter examples to this? Unless this changes the weight age of the brick across its aggregates, for example size based weight age for layout assignments, if we are considering schemes of that nature. So I can see this as follows, THE_Brick: /data/bricka Belongs to: aggregate 1 and aggregate 2, so get the following structure beneath it, /data/bricka/agg_1_ID/ /data/bricka/agg_2_ID/ Future splits of the bricks add more aggregate ID (not stating where or what this ID is, but assume this is something to distinguish aggregates) parents, and I would expect the xlator to send in requests into its aggregate parent and not root. One issue that I see with this is, if we wanted to snap an aggregate then we would snap the entire brick. Another is that how we distinguish the .glusterfs space across the aggregates? Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
On 26/06/2014, at 4:54 PM, Dan Lambright wrote: > Implementing brick splitting using LVM would allow you to treat each logical > volume (split) as an independent brick. Each split would have its own > .glusterfs subdirectory. I think this would help with taking snapshots as > well. Would brick splitting make directory browsing latency even scarier? + Justin -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
> > > For the short-term, wouldn't it be OK to disallow adding bricks that > > > is not a multiple of group-size? > > > > In the *very* short term, yes. However, I think that will quickly > > become an issue for users who try to deploy erasure coding because those > > group sizes will be quite large. As soon as we implement tiering, our > > very next task - perhaps even before tiering gets into a release - > > should be to implement automatic brick splitting. That will bring other > > benefits as well, such as variable replication levels to handle the > > sanlock case, or overlapping replica sets to spread a failed brick's > > load over more peers. > > > > OK. Do you have some initial ideas on how we could 'split' bricks? I ask this > to see if I can work on splitting bricks while the data classification format > is > being ironed out. I see split bricks as creating a logical space for the new aggregate that the brick belongs to. This may not need data movement etc. but just a logical branching at the root of the brick for its membership. Are there counter examples to this? Unless this changes the weight age of the brick across its aggregates, for example size based weight age for layout assignments, if we are considering schemes of that nature. So I can see this as follows, THE_Brick: /data/bricka Belongs to: aggregate 1 and aggregate 2, so get the following structure beneath it, /data/bricka/agg_1_ID/ /data/bricka/agg_2_ID/ Future splits of the bricks add more aggregate ID (not stating where or what this ID is, but assume this is something to distinguish aggregates) parents, and I would expect the xlator to send in requests into its aggregate parent and not root. One issue that I see with this is, if we wanted to snap an aggregate then we would snap the entire brick. Another is that how we distinguish the .glusterfs space across the aggregates? Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
On Wednesday 25 June 2014 11:42:10 Jeff Darcy wrote: > > How space will be allocated to each new sub-brick ? some sort of thin- > > provisioning or will it be distributed evenly on each split ? > > That's left to the user. The latest proposal, based on discussion of > the first, is here: > > https://docs.google.com/presentation/d/1e8tuh9DKNi9eCMrdt5vetppn1D3BiJSmfR7l > DW2wRvA/edit?usp=sharing > Thanks. I didn't know that document. > That has an example of assigning percentages to the sub-bricks created > by a rule (i.e. a subvolume in a potentially multi-tiered > configuration). Other possibilities include relative weights used to > determine percentages, or total thin provisioning where sub-bricks > compete freely for available space. It's certainly a fruitful area for > discussion. > > > If using thin-provisioning, it will be hard to determine real > > available space. If using a fixed amount, we can get to scenarios > > where a file cannot be written even if there seems to be enough free > > space. This can already happen today if using very big files on almost > > full bricks. I think brick splitting can accentuate this. > > Is this really common outside of test environments, given the sizes of > modern disks and files? Even in cases where it might happen, doesn't > striping address it? Considering that SSD sizes are still relatively small and that each brick can be split many times depending on data classification rules, I don't see it as a rare case for some scenarios. Striping can solve the problem at the expense of increasing fault probability, requiring more SSD's to compensate. > > We have a whole bunch of problems in this area. If multiple bricks are > on the same local file system, their capacity will be double-counted. > If a second local file system is mounted over part of a brick, the > additional space won't be counted at all. We do need a general solution > to this, but I don't think that solution needs to be part of data > classification unless there's a specific real-world scenario that DC > makes worse. > Agreed. This is a problem that should be solved independently of data classification. > > Also, the addition of multiple layered DHT translators, as it's > > implemented today, could add a lot more of latency, specially on > > directory listings. > > With http://review.gluster.org/#/c/7702/ this should be less of a > problem. This solves one of the problems. Directory listing is still one of the worst problems I've found with gluster and I think it's not solved by this patch. > Also, lookups across multiple tiers are likely to be rare in > most use cases. For example, for the name-based filtering (sanlock) > case, a given file should only *ever* be in one tier so only that tier > would need to be searched. For the activity-based tiering case, the > vast majority of lookups will be for hot files which are (not > accidentally) in the first tier. I think this is true as long as the rules are not modified. But if we allow to dynamically modify the rules once the volume is already running we will have the same problem as with rebalance, since there will be files not residing in the right tier for some time, and we need to find them nonetheless. This could be alleviated using something similar to the previous patch when the volume goes to a steady state again though. > The only real problem is with *failed* > lookups, e.g. during create. We can address that by adding "stubs" > (similar to linkfiles) in the upper tier, but I'd still want to wait > until it's proven necessary. What I would truly resist is any solution > that involves building tier awareness directly into (one instance of) > DHT. Besides requiring a much larger development effort in the present, > it would throw away the benefit of modularity and hamper other efforts > in the future. We need tiering and brick splitting *now*, especially as > a complement to erasure coding which many won't be able to use > otherwise. As far as I can tell, stacking translators is the fastest > way to get there. > I agree that it's not good to create specific solutions for a problem when it's possible to make a more generic solution that could be used to add more features. However I'm not so sure that brick splitting is the best solution. Basically we need to solve two problems right now: tiering and volume growing brick by brick. Brick splitting is a way to implement it, but I don't think it's the only one. > > Another problem I see is that splitting bricks will require a > > rebalance, which is a costly operation. It doesn't seem right to > > require a so expensive operation every time you add a new condition on > > an already created volume. > > Yes, rebalancing is expensive, but that's no different for split bricks > than whole ones. Any time you change the definition of what should go > where, you'll have to move some data into compliance and that's > expensive. However, such operations ar
Re: [Gluster-devel] Data classification proposal
> If I understand correctly the proposed data-classification > architecture, each server will have a number of bricks that will be > dynamically modified as needed: as more data-classifying conditions > are defined, a new layer of translators will be added (a new DHT or > AFR, or something else) and some or all existing bricks will be split > to accommodate the new and, maybe, overlapping condition. Correct. > How space will be allocated to each new sub-brick ? some sort of thin- > provisioning or will it be distributed evenly on each split ? That's left to the user. The latest proposal, based on discussion of the first, is here: https://docs.google.com/presentation/d/1e8tuh9DKNi9eCMrdt5vetppn1D3BiJSmfR7lDW2wRvA/edit?usp=sharing That has an example of assigning percentages to the sub-bricks created by a rule (i.e. a subvolume in a potentially multi-tiered configuration). Other possibilities include relative weights used to determine percentages, or total thin provisioning where sub-bricks compete freely for available space. It's certainly a fruitful area for discussion. > If using thin-provisioning, it will be hard to determine real > available space. If using a fixed amount, we can get to scenarios > where a file cannot be written even if there seems to be enough free > space. This can already happen today if using very big files on almost > full bricks. I think brick splitting can accentuate this. Is this really common outside of test environments, given the sizes of modern disks and files? Even in cases where it might happen, doesn't striping address it? We have a whole bunch of problems in this area. If multiple bricks are on the same local file system, their capacity will be double-counted. If a second local file system is mounted over part of a brick, the additional space won't be counted at all. We do need a general solution to this, but I don't think that solution needs to be part of data classification unless there's a specific real-world scenario that DC makes worse. > Also, the addition of multiple layered DHT translators, as it's > implemented today, could add a lot more of latency, specially on > directory listings. With http://review.gluster.org/#/c/7702/ this should be less of a problem. Also, lookups across multiple tiers are likely to be rare in most use cases. For example, for the name-based filtering (sanlock) case, a given file should only *ever* be in one tier so only that tier would need to be searched. For the activity-based tiering case, the vast majority of lookups will be for hot files which are (not accidentally) in the first tier. The only real problem is with *failed* lookups, e.g. during create. We can address that by adding "stubs" (similar to linkfiles) in the upper tier, but I'd still want to wait until it's proven necessary. What I would truly resist is any solution that involves building tier awareness directly into (one instance of) DHT. Besides requiring a much larger development effort in the present, it would throw away the benefit of modularity and hamper other efforts in the future. We need tiering and brick splitting *now*, especially as a complement to erasure coding which many won't be able to use otherwise. As far as I can tell, stacking translators is the fastest way to get there. > Another problem I see is that splitting bricks will require a > rebalance, which is a costly operation. It doesn't seem right to > require a so expensive operation every time you add a new condition on > an already created volume. Yes, rebalancing is expensive, but that's no different for split bricks than whole ones. Any time you change the definition of what should go where, you'll have to move some data into compliance and that's expensive. However, such operations are likely to be very rare. It's highly likely that most uses of this feature will consist of a simple two-tier setup defined when the volume is created and never changed thereafter, so the only rebalancing would be within a tier - i.e. the exact same thing we do today in homogeneous volumes (maybe even slightly better). The only use case I can think of that would involve *frequent* tier-config changes is multi-tenancy, but adding a new tenant should only affect new data and not require migration of old data. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
On Wednesday 25 June 2014 08:35:05 Jeff Darcy wrote: > > For the short-term, wouldn't it be OK to disallow adding bricks that > > is not a multiple of group-size? > > In the *very* short term, yes. However, I think that will quickly > become an issue for users who try to deploy erasure coding because those > group sizes will be quite large. As soon as we implement tiering, our > very next task - perhaps even before tiering gets into a release - > should be to implement automatic brick splitting. That will bring other > benefits as well, such as variable replication levels to handle the > sanlock case, or overlapping replica sets to spread a failed brick's > load over more peers. If I understand correctly the proposed data-classification architecture, each server will have a number of bricks that will be dynamically modified as needed: as more data-classifying conditions are defined, a new layer of translators will be added (a new DHT or AFR, or something else) and some or all existing bricks will be split to accommodate the new and, maybe, overlapping condition. How space will be allocated to each new sub-brick ? some sort of thin- provisioning or will it be distributed evenly on each split ? If using thin-provisioning, it will be hard to determine real available space. If using a fixed amount, we can get to scenarios where a file cannot be written even if there seems to be enough free space. This can already happen today if using very big files on almost full bricks. I think brick splitting can accentuate this. Also, the addition of multiple layered DHT translators, as it's implemented today, could add a lot more of latency, specially on directory listings. Another problem I see is that splitting bricks will require a rebalance, which is a costly operation. It doesn't seem right to require a so expensive operation every time you add a new condition on an already created volume. Maybe I've missed something important ? Thanks, Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
- Original Message - > > For the short-term, wouldn't it be OK to disallow adding bricks that > > is not a multiple of group-size? > > In the *very* short term, yes. However, I think that will quickly > become an issue for users who try to deploy erasure coding because those > group sizes will be quite large. As soon as we implement tiering, our > very next task - perhaps even before tiering gets into a release - > should be to implement automatic brick splitting. That will bring other > benefits as well, such as variable replication levels to handle the > sanlock case, or overlapping replica sets to spread a failed brick's > load over more peers. > OK. Do you have some initial ideas on how we could 'split' bricks? I ask this to see if I can work on splitting bricks while the data classification format is being ironed out. thanks, Krish ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
> For the short-term, wouldn't it be OK to disallow adding bricks that > is not a multiple of group-size? In the *very* short term, yes. However, I think that will quickly become an issue for users who try to deploy erasure coding because those group sizes will be quite large. As soon as we implement tiering, our very next task - perhaps even before tiering gets into a release - should be to implement automatic brick splitting. That will bring other benefits as well, such as variable replication levels to handle the sanlock case, or overlapping replica sets to spread a failed brick's load over more peers. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
Jeff, - Original Message - > > Am I right if I understood that the value for media-type is not > > interpreted beyond the scope of matching rules? That is to say, we > > don't need/have any notion of media-types that type check internally > > for forming (sub)volumes using the rules specified. > > Exactly. To us it's just an opaque ID. OK. That makes sense. > > > Should the no. of bricks or lower-level subvolumes that match the rule > > be an exact multiple of group-size? > > Good question. I think users see the current requirement to add bricks > in multiples of the replica/stripe size as an annoyance. This will only > get worse with erasure coding where the group size is larger. On the > other hand, we do need to make sure that members of a group are on > different machines. This is why I think we need to be able to split > bricks, so that we can use overlapping replica/erasure sets. For > example, if we have five bricks and two-way replication, we can split > bricks to get a multiple of two and life's good again. So *long term* I > think we can/should remove any restriction on users, but there are a > whole bunch of unsolved issues around brick splitting. I'm not sure > what to do in the short term. For the short-term, wouldn't it be OK to disallow adding bricks that is not a multiple of group-size? > > > > Here's a more complex example that adds replication and erasure > > > coding to the mix. > > > > > > # Assume 20 hosts, four fast and sixteen slow (named > > > appropriately). > > > > > > rule tier-1 > > > select *fast* > > > group-size 2 > > > type cluster/afr > > > > > > rule tier-2 > > > # special pattern matching otherwise-unused bricks > > > select %{unclaimed} > > > group-size 8 > > > type cluster/ec parity=2 > > > # i.e. two groups, each six data plus two parity > > > > > > rule all > > > select tier-1 > > > select tier-2 > > > type features/tiering > > > > > > > In the above example we would have 2 subvolumes each containing 2 > > bricks that would be aggregated by rule tier-1. Lets call those > > subvolumes as tier-1-fast-0 and tier-fast-1. Both of these subvolumes > > are afr based two-way replicated subvolumes. Are these instances of > > tier-1-* composed using cluster/dht by the default semantics? > > Yes. Any time we have multiple subvolumes and no other specified way to > combine them into one, we just slap DHT on top. We do this already at > the top level; with data classification we might do it at lower levels > too. > thanks, Krish ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
> Its possible to express your example using lists if their entries are allowed > to overlap. I see that you wanted a way to express a matrix (overlapping > rules) with gluster's tree-like syntax as backdrop. > > A polytree may be a better term than matrix (DAG without cycles), i.e. when > there are overlaps a node in the graph gets multiple in-arcs. > > Syntax aside, we seem to part on "where" to solve the problem- config file or > UX. I prefer the UX have the logic to build the configuration file, given > how complex it can be. My preference would be for the config file be mostly > "read only" with extremely simple syntax. > > I'll put some more thought into this and believe this discussion has > illuminated some good points. > > Brick: host1:/SSD1 SSD1 > Brick: host1:/SSD2 SSD2 > Brick: host2:/SSD3 SSD3 > Brick: host2:/SSD4 SSD4 > Brick: host1:/DISK1 DISK1 > > rule rack4: > select SSD1, SSD2, DISK1 > > # some files should go on ssds in rack 4 > rule A: > option filter-condition *.lock > select SSD1, SSD2 > > # some files should go on ssds anywhere > rule B: > option filter-condition *.out > select SSD1, SSD2, SSD3, SSD4 > > # some files should go anywhere in rack 4 > rule C > option filter-condition *.c > select rack4 > > # some files we just don't care > rule D > option filter-condition *.h > select SSD1, SSD2, SSD3, SSD4, DISK1 > > volume: > option filter-condition A,B,C,D This seems to leave us with two options. One option is that "select" supports only explicit enumeration, so that adding a brick means editing multiple rules that apply to it. The other option is that "select" supports wildcards. Using a regex to match parts of a name is effectively the same as matching the explicit tags we started with, except that expressing complex Boolean conditions using a regex can get more than a bit messy. As Jamie Zawinski famously said: > Some people, when confronted with a problem, think "I know, I'll use > regular expressions." Now they have two problems. I think it's nice to support regexes instead of plain strings in lower-level rules, but relying on them alone to express complex higher-level policies would IMO be a mistake. Likewise, defining a proper syntax for a config file seems both more flexible and easier than defining one for a CLI, where the parsing options are even more limited. What happens when someone wants to use Puppet (for example) to set this up? Then the user would express their will in Puppet syntax, which would have to convert it to our CLI syntax, which would convert it to our config-file syntax. Why not allow them to skip a step where information might get lost or mangled in translation? We can still have CLI commands to do the most common kinds of manipulation, as we do for volfiles, but the final form can be more extensible. It will still be more comprehensible than Ceph's CRUSH maps. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
Its possible to express your example using lists if their entries are allowed to overlap. I see that you wanted a way to express a matrix (overlapping rules) with gluster's tree-like syntax as backdrop. A polytree may be a better term than matrix (DAG without cycles), i.e. when there are overlaps a node in the graph gets multiple in-arcs. Syntax aside, we seem to part on "where" to solve the problem- config file or UX. I prefer the UX have the logic to build the configuration file, given how complex it can be. My preference would be for the config file be mostly "read only" with extremely simple syntax. I'll put some more thought into this and believe this discussion has illuminated some good points. Brick: host1:/SSD1 SSD1 Brick: host1:/SSD2 SSD2 Brick: host2:/SSD3 SSD3 Brick: host2:/SSD4 SSD4 Brick: host1:/DISK1 DISK1 rule rack4: select SSD1, SSD2, DISK1 # some files should go on ssds in rack 4 rule A: option filter-condition *.lock select SSD1, SSD2 # some files should go on ssds anywhere rule B: option filter-condition *.out select SSD1, SSD2, SSD3, SSD4 # some files should go anywhere in rack 4 rule C option filter-condition *.c select rack4 # some files we just don't care rule D option filter-condition *.h select SSD1, SSD2, SSD3, SSD4, DISK1 volume: option filter-condition A,B,C,D - Original Message - From: "Jeff Darcy" To: "Dan Lambright" Cc: "Gluster Devel" Sent: Monday, June 23, 2014 7:11:44 PM Subject: Re: [Gluster-devel] Data classification proposal > Rather than using the keyword "unclaimed", my instinct was to > explicitly list which bricks have not been "claimed". Perhaps you > have something more subtle in mind, it is not apparent to me from your > response. Can you provide an example of why it is necessary and a list > could not be provided in its place? If the list is somehow "difficult > to figure out", due to a particularly complex setup or some such, I'd > prefer a CLI/GUI build that list rather than having sysadmins > hand-edit this file. It's not *difficult* to make sure every brick has been enumerated by some rule, and that there are no overlaps, but it's certainly tedious and error prone. Imagine that a user has four has bricks in four machines, using names like serv1-b1, serv1-b2, ..., serv4-b6. Accordingly, they've set up rules to put serv1* into one set and serv[234]* into another set (which is already more flexibility than I think your proposal gave them). Now when they add serv5 they need an extra step to add it to the tiering config, which wouldn't have been necessary if we supported defaults. What percentage of users would forget that step at least once? I don't know for sure, but I'd guess it's pretty high. Having a CLI or GUI create configs just means that we have to add support for defaults there instead. We'd still have to implement the same logic, they'd still have to specify the same thing. That just seems like moving the problem around instead of solving it. > The key-value piece seems like syntactic sugar - an "alias". If so, > let the name itself be the alias. No notions of SSD or physical > location need be inserted. Unless I am missing that it *is* necessary, > I stand by that value judgement as a philosophy of not putting > anything into the configuration file that you don't require. Can you > provide an example of where it is necessary? OK... - Brick: SSD1 Brick: SSD2 Brick: SSD3 Brick: SSD4 Brick: DISK1 rack4: SSD1, SSD2, DISK1 filter A : SSD1, SSD2 filter B : SSD1,SSD2, SSD3, SSD4 filter C: rack4 filter D: SSD1, SSD2, SSD3, SSD4, DISK1 meta-filter: filter A, filter B, filter C, filter D * some files should go on ssds in rack 4 * some files should go on ssds anywhere * some files should go anywhere in rack 4 * some files we just don't care Notice how the rules *overlap*. We can't support that if our syntax only allows the user to express a list (or list of lists). If the list is ordered by type, we can't also support location-based rules. If the list is ordered by location, we lose type-based rules instead. Brick properties create a matrix, with an unknown number of dimensions (e.g. security level, tenant ID, and so on as well as type and location). The logical way to represent such a space for rule-matching purposes is to let users define however many dimensions (keys) as they want and as many values for each dimension as they want. Whether the exact string "type" or "unclaimed" appears anywhere isn't the issue. What matters is that the *semantics* of assigning properties to a brick have to be more sophisticated than just assigning each a position in a list, and we need a syntax that supports those semantics. Otherw
Re: [Gluster-devel] Data classification proposal
> Am I right if I understood that the value for media-type is not > interpreted beyond the scope of matching rules? That is to say, we > don't need/have any notion of media-types that type check internally > for forming (sub)volumes using the rules specified. Exactly. To us it's just an opaque ID. > Should the no. of bricks or lower-level subvolumes that match the rule > be an exact multiple of group-size? Good question. I think users see the current requirement to add bricks in multiples of the replica/stripe size as an annoyance. This will only get worse with erasure coding where the group size is larger. On the other hand, we do need to make sure that members of a group are on different machines. This is why I think we need to be able to split bricks, so that we can use overlapping replica/erasure sets. For example, if we have five bricks and two-way replication, we can split bricks to get a multiple of two and life's good again. So *long term* I think we can/should remove any restriction on users, but there are a whole bunch of unsolved issues around brick splitting. I'm not sure what to do in the short term. > > Here's a more complex example that adds replication and erasure > > coding to the mix. > > > > # Assume 20 hosts, four fast and sixteen slow (named > > appropriately). > > > > rule tier-1 > > select *fast* > > group-size 2 > > type cluster/afr > > > > rule tier-2 > > # special pattern matching otherwise-unused bricks > > select %{unclaimed} > > group-size 8 > > type cluster/ec parity=2 > > # i.e. two groups, each six data plus two parity > > > > rule all > > select tier-1 > > select tier-2 > > type features/tiering > > > > In the above example we would have 2 subvolumes each containing 2 > bricks that would be aggregated by rule tier-1. Lets call those > subvolumes as tier-1-fast-0 and tier-fast-1. Both of these subvolumes > are afr based two-way replicated subvolumes. Are these instances of > tier-1-* composed using cluster/dht by the default semantics? Yes. Any time we have multiple subvolumes and no other specified way to combine them into one, we just slap DHT on top. We do this already at the top level; with data classification we might do it at lower levels too. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
Jeff, I have a few questions regarding the rules syntax and how they apply. I think this is different in spirit from the discussion Dan has started and keeping it separate. See questions inline. - Original Message - > One of the things holding up our data classification efforts (which include > tiering but also other stuff as well) has been the extension of the same > conceptual model from the I/O path to the configuration subsystem and > ultimately to the user experience. How does an administrator define a > tiering policy without tearing their hair out? How does s/he define a mixed > replication/erasure-coding setup without wanting to rip *our* hair out? The > included Markdown document attempts to remedy this by proposing one out of > many possible models and user interfaces. It includes examples for some of > the most common use cases, including the "replica 2.5" case we'e been > discussing recently. Constructive feedback would be greatly appreciated. > > > > # Data Classification Interface > > The data classification feature is extremely flexible, to cover use cases > from > SSD/disk tiering to rack-aware placement to security or other policies. With > this flexibility comes complexity. While this complexity does not affect the > I/O path much, it does affect both the volume-configuration subsystem and the > user interface to set placement policies. This document describes one > possible > model and user interface. > > The model we used is based on two kinds of information: brick descriptions > and > aggregation rules. Both are contained in a configuration file (format TBD) > which can be associated with a volume using a volume option. > > ## Brick Descriptions > > A brick is described by a series of simple key/value pairs. Predefined keys > include: > > * **media-type** >The underlying media type for the brick. In its simplest form this might >just be *ssd* or *disk*. More sophisticated users might use something >like >*15krpm* to represent a faster disk, or *perc-raid5* to represent a brick >backed by a RAID controller. Am I right if I understood that the value for media-type is not interpreted beyond the scope of matching rules? That is to say, we don't need/have any notion of media-types that type check internally for forming (sub)volumes using the rules specified. > > * **rack** (and/or **row**) >The physical location of the brick. Some policy rules might be set up to >spread data across more than one rack. > > User-defined keys are also allowed. For example, some users might use a > *tenant* or *security-level* tag as the basis for their placement policy. > > ## Aggregation Rules > > Aggregation rules are used to define how bricks should be combined into > subvolumes, and those potentially combined into higher-level subvolumes, and > so > on until all of the bricks are accounted for. Each aggregation rule consists > of the following parts: > > * **id** >The base name of the subvolumes the rule will create. If a rule is >applied >multiple times this will yield *id-0*, *id-1*, and so on. > > * **selector** >A "filter" for which bricks or lower-level subvolumes the rule will >aggregate. This is an expression similar to a *WHERE* clause in SQL, >using >brick/subvolume names and properties in lieu of columns. These values are >then matched against literal values or regular expressions, using the >usual >set of boolean operators to arrive at a *yes* or *no* answer to the >question >of whether this brick/subvolume is affected by this rule. > > * **group-size** (optional) >The number of original bricks/subvolumes to be combined into each produced >subvolume. The special default value zero means to collect all original >bricks or subvolumes into one final subvolume. In this case, *id* is used >directly instead of having a numeric suffix appended. Should the no. of bricks or lower-level subvolumes that match the rule be an exact multiple of group-size? > > * **type** (optional) >The type of the generated translator definition(s). Examples might >include >"AFR" to do replication, "EC" to do erasure coding, and so on. The more >general data classification task includes the definition of new >translators >to do tiering and other kinds of filtering, but those are beyond the scope >of this document. If no type is specified, cluster/dht will be used to do >random placement among its constituents. > > * **tag** and **option** (optional, repeatable) >Additional tags and/or options to be applied to each newly created >subvolume. See the "replica 2.5" example to see how this can be used. > > Since each type might have unique requirements, such as ensuring that > replication is done across machines or racks whenever possible, it is assumed > that there will be corresponding type-specific scripts or functions to do the > a
Re: [Gluster-devel] Data classification proposal
> Rather than using the keyword "unclaimed", my instinct was to > explicitly list which bricks have not been "claimed". Perhaps you > have something more subtle in mind, it is not apparent to me from your > response. Can you provide an example of why it is necessary and a list > could not be provided in its place? If the list is somehow "difficult > to figure out", due to a particularly complex setup or some such, I'd > prefer a CLI/GUI build that list rather than having sysadmins > hand-edit this file. It's not *difficult* to make sure every brick has been enumerated by some rule, and that there are no overlaps, but it's certainly tedious and error prone. Imagine that a user has four has bricks in four machines, using names like serv1-b1, serv1-b2, ..., serv4-b6. Accordingly, they've set up rules to put serv1* into one set and serv[234]* into another set (which is already more flexibility than I think your proposal gave them). Now when they add serv5 they need an extra step to add it to the tiering config, which wouldn't have been necessary if we supported defaults. What percentage of users would forget that step at least once? I don't know for sure, but I'd guess it's pretty high. Having a CLI or GUI create configs just means that we have to add support for defaults there instead. We'd still have to implement the same logic, they'd still have to specify the same thing. That just seems like moving the problem around instead of solving it. > The key-value piece seems like syntactic sugar - an "alias". If so, > let the name itself be the alias. No notions of SSD or physical > location need be inserted. Unless I am missing that it *is* necessary, > I stand by that value judgement as a philosophy of not putting > anything into the configuration file that you don't require. Can you > provide an example of where it is necessary? OK... * some files should go on ssds in rack 4 * some files should go on ssds anywhere * some files should go anywhere in rack 4 * some files we just don't care Notice how the rules *overlap*. We can't support that if our syntax only allows the user to express a list (or list of lists). If the list is ordered by type, we can't also support location-based rules. If the list is ordered by location, we lose type-based rules instead. Brick properties create a matrix, with an unknown number of dimensions (e.g. security level, tenant ID, and so on as well as type and location). The logical way to represent such a space for rule-matching purposes is to let users define however many dimensions (keys) as they want and as many values for each dimension as they want. Whether the exact string "type" or "unclaimed" appears anywhere isn't the issue. What matters is that the *semantics* of assigning properties to a brick have to be more sophisticated than just assigning each a position in a list, and we need a syntax that supports those semantics. Otherwise we'll end up solving the same UX problems again and again each time we add a feature that involves treating bricks or data differently. Each time we'll probably do it a little differently and confuse users a little more, if history is any guide. That's what I'd rather avoid. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
Rather than using the keyword "unclaimed", my instinct was to explicitly list which bricks have not been "claimed". Perhaps you have something more subtle in mind, it is not apparent to me from your response. Can you provide an example of why it is necessary and a list could not be provided in its place? If the list is somehow "difficult to figure out", due to a particularly complex setup or some such, I'd prefer a CLI/GUI build that list rather than having sysadmins hand-edit this file. The key-value piece seems like syntactic sugar - an "alias". If so, let the name itself be the alias. No notions of SSD or physical location need be inserted. Unless I am missing that it *is* necessary, I stand by that value judgement as a philosophy of not putting anything into the configuration file that you don't require. Can you provide an example of where it is necessary? As to your point on filtering (which files go into which tier/group). I wrote a little further in the email that I do not see a way around regular expressions within the filter-condition keyword. My understanding of your proposal is the select statement did not do file name filtering, the "filter-condition" option did. I'm ok with that. As far as the "user stories" idea goes, that seems like a good next step. - Original Message - From: "Jeff Darcy" To: "Dan Lambright" Cc: "Gluster Devel" Sent: Monday, June 23, 2014 5:24:14 PM Subject: Re: [Gluster-devel] Data classification proposal > A frustrating aspect of Linux is the complexity of /etc configuration file's > formats (rsyslog.conf, logrotate, cron, yum repo files, etc) In that spirit > I would simplify the "select" in the data classification proposal (copied > below) to only accept a list of bricks/sub-tiers with wild-cards '*', rather > than full-blown regular expressions or key/value pairs. Then how does *the user* specify which files should go into which tier/group? If we don't let them specify that in configuration, then it can only be done in code and we've taken a choice away from them. > I would drop the > "unclaimed" keyword Then how do you specify any kind of default rule for files not matched elsewhere? If certain files can be placed only in certain locations due to security or compliance considerations, how would they specify the location(s) for files not subject to any such limitation? > and not have keywords "media type", and "rack". It does > not seem necessary to introduce new keys for the underlying block device > type (SSD vs disk) any more than we need to express the filesystem (XFS vs > ext4). The idea is to let users specify whatever criteria matter *to them*; media type and rack/row are just examples to get them started. > In other words, I think tiering can be fully expressed in the > configuration file while still abstracting the underlying storage. Yes, *tiering* can be expressed using a simpler syntax. I was trying for something that could also support placement policies other than strict linear "above" vs. "below" with only the migration policies we've written into code. > That > said, the configuration file could be built up by a CLI or GUI, and richer > expressibility could exist at that level. > > example: > > brick host1:/brick ssd-group0-1 > > brick host2:/brick ssd-group0-2 > > brick host3:/brick disk-group0-1 > > rule tier-1 > select ssd-group0* > > rule tier-2 > select disk-group0 > > rule all > select tier-1 > # use repeated "select" to establish order > select tier-2 > type features/tiering > > The filtering option's regular expressions seem hard to avoid. If just the > name of the file satisfies most use cases (that we know of?) I do not think > there is any way to avoid regular expressions in the option for filters. > (Down the road, if we were to allow complete flexibility in how files can be > distributed across subvolumes, the filtering problems may start to look > similar to 90s-era packet classification with a solution along the lines of > the Berkeley packet filter.) > > There may be different rules by which data is distributed at the "tiering" > level. For example, one tiering policy could be the fast tier (first > listed). It would be a "cache" for the slow tier (second listed). I think > the "option" keyword could handle that. > > rule all > select tier-1 ># use repeated "select" to establish order > select tier-2 > type features/tiering > option tier-cache, mode=writeback, dirty-watermark=80 > > Another example
Re: [Gluster-devel] Data classification proposal
> A frustrating aspect of Linux is the complexity of /etc configuration file's > formats (rsyslog.conf, logrotate, cron, yum repo files, etc) In that spirit > I would simplify the "select" in the data classification proposal (copied > below) to only accept a list of bricks/sub-tiers with wild-cards '*', rather > than full-blown regular expressions or key/value pairs. Then how does *the user* specify which files should go into which tier/group? If we don't let them specify that in configuration, then it can only be done in code and we've taken a choice away from them. > I would drop the > "unclaimed" keyword Then how do you specify any kind of default rule for files not matched elsewhere? If certain files can be placed only in certain locations due to security or compliance considerations, how would they specify the location(s) for files not subject to any such limitation? > and not have keywords "media type", and "rack". It does > not seem necessary to introduce new keys for the underlying block device > type (SSD vs disk) any more than we need to express the filesystem (XFS vs > ext4). The idea is to let users specify whatever criteria matter *to them*; media type and rack/row are just examples to get them started. > In other words, I think tiering can be fully expressed in the > configuration file while still abstracting the underlying storage. Yes, *tiering* can be expressed using a simpler syntax. I was trying for something that could also support placement policies other than strict linear "above" vs. "below" with only the migration policies we've written into code. > That > said, the configuration file could be built up by a CLI or GUI, and richer > expressibility could exist at that level. > > example: > > brick host1:/brick ssd-group0-1 > > brick host2:/brick ssd-group0-2 > > brick host3:/brick disk-group0-1 > > rule tier-1 > select ssd-group0* > > rule tier-2 > select disk-group0 > > rule all > select tier-1 > # use repeated "select" to establish order > select tier-2 > type features/tiering > > The filtering option's regular expressions seem hard to avoid. If just the > name of the file satisfies most use cases (that we know of?) I do not think > there is any way to avoid regular expressions in the option for filters. > (Down the road, if we were to allow complete flexibility in how files can be > distributed across subvolumes, the filtering problems may start to look > similar to 90s-era packet classification with a solution along the lines of > the Berkeley packet filter.) > > There may be different rules by which data is distributed at the "tiering" > level. For example, one tiering policy could be the fast tier (first > listed). It would be a "cache" for the slow tier (second listed). I think > the "option" keyword could handle that. > > rule all > select tier-1 ># use repeated "select" to establish order > select tier-2 > type features/tiering > option tier-cache, mode=writeback, dirty-watermark=80 > > Another example tiering policy could be based on compliance ; when a file > needs to become read-only, it moves from the first listed tier to the > second. > > rule all >select tier-1 ># use repeated "select" to establish order >select tier-2 >type features/tiering > option tier-retention OK, good so far. How would you handle the "replica 2.5" sanlock case with the simplified syntax? Or security-aware placement equivalent to this? rule secure select brick-0-* option encryption on rule insecure select brick-1-* option encryption off rule all select secure select insecure type features/filter option filter-condition-1 security-level:high option filter-target-1 secure option default-subvol insecure In true agile fashion, maybe we should compile a set of "user stories" and treat those as test cases for any proposed syntax. That would need to include at least * hot/cold tiering * HIPPA/EUPD style compliance (file must *always* or *never* be in X) * security-aware placement * multi-tenancy * sanlock case I'm not trying to create complexity for its own sake. If there's a simpler syntax that doesn't eliminate some of these cases in favor of tiering and nothing else, that would be great. > - Original Message - > From: "Jeff Darcy" > To: "Gluster Devel" > Sent: Friday, May 23, 2014 3:30:39 PM > Subject: [Gluster-devel] Data classification propo
Re: [Gluster-devel] Data classification proposal
A frustrating aspect of Linux is the complexity of /etc configuration file's formats (rsyslog.conf, logrotate, cron, yum repo files, etc) In that spirit I would simplify the "select" in the data classification proposal (copied below) to only accept a list of bricks/sub-tiers with wild-cards '*', rather than full-blown regular expressions or key/value pairs. I would drop the "unclaimed" keyword, and not have keywords "media type", and "rack". It does not seem necessary to introduce new keys for the underlying block device type (SSD vs disk) any more than we need to express the filesystem (XFS vs ext4). In other words, I think tiering can be fully expressed in the configuration file while still abstracting the underlying storage. That said, the configuration file could be built up by a CLI or GUI, and richer expressibility could exist at that level. example: brick host1:/brick ssd-group0-1 brick host2:/brick ssd-group0-2 brick host3:/brick disk-group0-1 rule tier-1 select ssd-group0* rule tier-2 select disk-group0 rule all select tier-1 # use repeated "select" to establish order select tier-2 type features/tiering The filtering option's regular expressions seem hard to avoid. If just the name of the file satisfies most use cases (that we know of?) I do not think there is any way to avoid regular expressions in the option for filters. (Down the road, if we were to allow complete flexibility in how files can be distributed across subvolumes, the filtering problems may start to look similar to 90s-era packet classification with a solution along the lines of the Berkeley packet filter.) There may be different rules by which data is distributed at the "tiering" level. For example, one tiering policy could be the fast tier (first listed). It would be a "cache" for the slow tier (second listed). I think the "option" keyword could handle that. rule all select tier-1 # use repeated "select" to establish order select tier-2 type features/tiering option tier-cache, mode=writeback, dirty-watermark=80 Another example tiering policy could be based on compliance ; when a file needs to become read-only, it moves from the first listed tier to the second. rule all select tier-1 # use repeated "select" to establish order select tier-2 type features/tiering option tier-retention - Original Message - From: "Jeff Darcy" To: "Gluster Devel" Sent: Friday, May 23, 2014 3:30:39 PM Subject: [Gluster-devel] Data classification proposal One of the things holding up our data classification efforts (which include tiering but also other stuff as well) has been the extension of the same conceptual model from the I/O path to the configuration subsystem and ultimately to the user experience. How does an administrator define a tiering policy without tearing their hair out? How does s/he define a mixed replication/erasure-coding setup without wanting to rip *our* hair out? The included Markdown document attempts to remedy this by proposing one out of many possible models and user interfaces. It includes examples for some of the most common use cases, including the "replica 2.5" case we'e been discussing recently. Constructive feedback would be greatly appreciated. # Data Classification Interface The data classification feature is extremely flexible, to cover use cases from SSD/disk tiering to rack-aware placement to security or other policies. With this flexibility comes complexity. While this complexity does not affect the I/O path much, it does affect both the volume-configuration subsystem and the user interface to set placement policies. This document describes one possible model and user interface. The model we used is based on two kinds of information: brick descriptions and aggregation rules. Both are contained in a configuration file (format TBD) which can be associated with a volume using a volume option. ## Brick Descriptions A brick is described by a series of simple key/value pairs. Predefined keys include: * **media-type** The underlying media type for the brick. In its simplest form this might just be *ssd* or *disk*. More sophisticated users might use something like *15krpm* to represent a faster disk, or *perc-raid5* to represent a brick backed by a RAID controller. * **rack** (and/or **row**) The physical location of the brick. Some policy rules might be set up to spread data across more than one rack. User-defined keys are also allowed. For example, some users might use a *tenant* or *security-level* tag as the basis for their placement policy. ## Aggregation Rules Aggregation rules are used to define how bricks should be combined into subvo
[Gluster-devel] Data classification proposal
One of the things holding up our data classification efforts (which include tiering but also other stuff as well) has been the extension of the same conceptual model from the I/O path to the configuration subsystem and ultimately to the user experience. How does an administrator define a tiering policy without tearing their hair out? How does s/he define a mixed replication/erasure-coding setup without wanting to rip *our* hair out? The included Markdown document attempts to remedy this by proposing one out of many possible models and user interfaces. It includes examples for some of the most common use cases, including the "replica 2.5" case we'e been discussing recently. Constructive feedback would be greatly appreciated. # Data Classification Interface The data classification feature is extremely flexible, to cover use cases from SSD/disk tiering to rack-aware placement to security or other policies. With this flexibility comes complexity. While this complexity does not affect the I/O path much, it does affect both the volume-configuration subsystem and the user interface to set placement policies. This document describes one possible model and user interface. The model we used is based on two kinds of information: brick descriptions and aggregation rules. Both are contained in a configuration file (format TBD) which can be associated with a volume using a volume option. ## Brick Descriptions A brick is described by a series of simple key/value pairs. Predefined keys include: * **media-type** The underlying media type for the brick. In its simplest form this might just be *ssd* or *disk*. More sophisticated users might use something like *15krpm* to represent a faster disk, or *perc-raid5* to represent a brick backed by a RAID controller. * **rack** (and/or **row**) The physical location of the brick. Some policy rules might be set up to spread data across more than one rack. User-defined keys are also allowed. For example, some users might use a *tenant* or *security-level* tag as the basis for their placement policy. ## Aggregation Rules Aggregation rules are used to define how bricks should be combined into subvolumes, and those potentially combined into higher-level subvolumes, and so on until all of the bricks are accounted for. Each aggregation rule consists of the following parts: * **id** The base name of the subvolumes the rule will create. If a rule is applied multiple times this will yield *id-0*, *id-1*, and so on. * **selector** A "filter" for which bricks or lower-level subvolumes the rule will aggregate. This is an expression similar to a *WHERE* clause in SQL, using brick/subvolume names and properties in lieu of columns. These values are then matched against literal values or regular expressions, using the usual set of boolean operators to arrive at a *yes* or *no* answer to the question of whether this brick/subvolume is affected by this rule. * **group-size** (optional) The number of original bricks/subvolumes to be combined into each produced subvolume. The special default value zero means to collect all original bricks or subvolumes into one final subvolume. In this case, *id* is used directly instead of having a numeric suffix appended. * **type** (optional) The type of the generated translator definition(s). Examples might include "AFR" to do replication, "EC" to do erasure coding, and so on. The more general data classification task includes the definition of new translators to do tiering and other kinds of filtering, but those are beyond the scope of this document. If no type is specified, cluster/dht will be used to do random placement among its constituents. * **tag** and **option** (optional, repeatable) Additional tags and/or options to be applied to each newly created subvolume. See the "replica 2.5" example to see how this can be used. Since each type might have unique requirements, such as ensuring that replication is done across machines or racks whenever possible, it is assumed that there will be corresponding type-specific scripts or functions to do the actual aggregation. This might even be made pluggable some day (TBD). Once all rule-based aggregation has been done, volume options are applied similarly to how they are now. Astute readers might have noticed that it's possible for a brick to be aggregated more than once. This is intentional. If a brick is part of multiple aggregates, it will be automatically split into multiple bricks internally but this will be invisible to the user. ## Examples Let's start with a simple tiering example. Here's what the data-classification config file might look like. brick host1:/brick media-type = ssd brick host2:/brick media-type = disk brick host3:/brick media-type = disk rule tier-1