Re: [Gluster-devel] Data classification proposal

2014-06-27 Thread Jeff Darcy
> Sounds like a metadata server would fix this!
> 
> ( Yes, this is trolling hard.  Ignore. ;> )

Fortunately I'm on vacation now, so my head didn't explode.  ;)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-27 Thread Justin Clift
On 27/06/2014, at 8:39 AM, Xavier Hernandez wrote:
> On Thursday 26 June 2014 12:52:13 Dan Lambright wrote:
>> I don't think brick splitting implemented by LVM would affect directory
>> browsing any more than adding an additional brick would,
>> 
> 
> Yes, splitting a brick in LVM should be the same than adding a normal brick. 
> The main problem I see is that adding normal bricks decrease the browsing 
> speed, so splitting bricks will also degrade it.
> 
> I've seen a configuration with only 14 bricks (7 replica-2 sets) where 
> browsing was not possible: directory listings with no more than a few 
> hundreds 
> of files took up to a minute or even more if the directory wasn't accessed 
> for 
> a long time. This is not usable.
> 
> This wasn't a hardware problem: servers had 2 CPU's with 6 cores each and 
> hyperthreading (total 24 cores), 64 GB of RAM and Infiniband network. File 
> system was formated using XFS.
> 
> I fear what can happen if the number of bricks grow considerably by splitting 
> without solving this problem before...


Sounds like a metadata server would fix this!

( Yes, this is trolling hard.  Ignore. ;> )

+ Justin

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-27 Thread Vivek Agarwal

On 06/27/2014 12:46 AM, Shyamsundar Ranganathan wrote:

Wanted to add to the thought process a different angle towards thinking about 
the data classified volumes.

One of the reasons for classifying data (be it tiering or others, like high 
profile users to high profile storage backends), is to deal with its (i.e data) 
protection differently.

With the current model as we discuss presenting the entire volume for 
consumption by clients to the file system, we should think about clients like 
backup, where the backup policy for a sub volume could differ from the backup 
policy for another (or say geo replication instead of backup).

I would think, other such use cases/clients would need to view parts of the 
volume and not the whole, when attempting to perform their function. For 
example in the backup case, the fast tier could be backed up daily and the slow 
tier could be backed up weekly, in which case one would need volume graphs that 
split this view for the client in question.
Agreed, the proposal sent by Joseph Fernandes a couple of days back 
suggests something similar. You might want to look at the presentation 
sent by him.

Subject line being "Proposal for Gluster Compliance Feature"

Regards,
Vivek

Just a thought.

Shyam

- Original Message -

From: "Dan Lambright" 
To: "Jeff Darcy" 
Cc: "Gluster Devel" 
Sent: Monday, June 23, 2014 4:48:13 PM
Subject: Re: [Gluster-devel] Data classification proposal

A frustrating aspect of Linux is the complexity of /etc configuration file's
formats (rsyslog.conf, logrotate, cron, yum repo files, etc) In that spirit
I would simplify the "select" in the data classification proposal (copied
below) to only accept a list of bricks/sub-tiers with wild-cards '*', rather
than full-blown regular expressions or key/value pairs. I would drop the
"unclaimed" keyword, and not have keywords "media type", and "rack". It does
not seem necessary to introduce new keys for the underlying block device
type (SSD vs disk) any more than we need to express the filesystem (XFS vs
ext4). In other words, I think tiering can be fully expressed in the
configuration file while still abstracting the underlying storage. That
said, the configuration file could be built up by a CLI or GUI, and richer
expressibility could exist at that level.

example:

brick host1:/brick ssd-group0-1

brick host2:/brick ssd-group0-2

brick host3:/brick disk-group0-1

rule tier-1
select ssd-group0*

rule tier-2
select disk-group0

rule all
select tier-1
# use repeated "select" to establish order
select tier-2
type features/tiering

The filtering option's regular expressions seem hard to avoid. If just the
name of the file satisfies most use cases (that we know of?) I do not think
there is any way to avoid regular expressions in the option for filters.
(Down the road, if we were to allow complete flexibility in how files can be
distributed across subvolumes, the filtering problems may start to look
similar to 90s-era packet classification with a solution along the lines of
the Berkeley packet filter.)

There may be different rules by which data is distributed at the "tiering"
level. For example, one tiering policy could be the fast tier (first
listed). It would be a "cache" for the slow tier (second listed). I think
the "option" keyword could handle that.

rule all
select tier-1
 # use repeated "select" to establish order
select tier-2
type features/tiering
option tier-cache, mode=writeback, dirty-watermark=80

Another example tiering policy could be based on compliance ; when a file
needs to become read-only, it moves from the first listed tier to the
second.

rule all
 select tier-1
 # use repeated "select" to establish order
 select tier-2
 type features/tiering
    option tier-retention

- Original Message -
From: "Jeff Darcy" 
To: "Gluster Devel" 
Sent: Friday, May 23, 2014 3:30:39 PM
Subject: [Gluster-devel] Data classification proposal

One of the things holding up our data classification efforts (which include
tiering but also other stuff as well) has been the extension of the same
conceptual model from the I/O path to the configuration subsystem and
ultimately to the user experience.  How does an administrator define a
tiering policy without tearing their hair out?  How does s/he define a mixed
replication/erasure-coding setup without wanting to rip *our* hair out?  The
included Markdown document attempts to remedy this by proposing one out of
many possible models and user interfaces.  It includes examples for some of
the most common use cases, including the "replica 2.5" case we'e been
discussing recently.  Constructive feedback would be greatly appreciated.



Re: [Gluster-devel] Data classification proposal

2014-06-27 Thread Xavier Hernandez
On Thursday 26 June 2014 12:52:13 Dan Lambright wrote:
> I don't think brick splitting implemented by LVM would affect directory
> browsing any more than adding an additional brick would,
> 

Yes, splitting a brick in LVM should be the same than adding a normal brick. 
The main problem I see is that adding normal bricks decrease the browsing 
speed, so splitting bricks will also degrade it.

I've seen a configuration with only 14 bricks (7 replica-2 sets) where 
browsing was not possible: directory listings with no more than a few hundreds 
of files took up to a minute or even more if the directory wasn't accessed for 
a long time. This is not usable.

This wasn't a hardware problem: servers had 2 CPU's with 6 cores each and 
hyperthreading (total 24 cores), 64 GB of RAM and Infiniband network. File 
system was formated using XFS.

I fear what can happen if the number of bricks grow considerably by splitting 
without solving this problem before...

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-26 Thread Shyamsundar Ranganathan
Wanted to add to the thought process a different angle towards thinking about 
the data classified volumes.

One of the reasons for classifying data (be it tiering or others, like high 
profile users to high profile storage backends), is to deal with its (i.e data) 
protection differently.

With the current model as we discuss presenting the entire volume for 
consumption by clients to the file system, we should think about clients like 
backup, where the backup policy for a sub volume could differ from the backup 
policy for another (or say geo replication instead of backup).

I would think, other such use cases/clients would need to view parts of the 
volume and not the whole, when attempting to perform their function. For 
example in the backup case, the fast tier could be backed up daily and the slow 
tier could be backed up weekly, in which case one would need volume graphs that 
split this view for the client in question.

Just a thought.

Shyam

- Original Message -
> From: "Dan Lambright" 
> To: "Jeff Darcy" 
> Cc: "Gluster Devel" 
> Sent: Monday, June 23, 2014 4:48:13 PM
> Subject: Re: [Gluster-devel] Data classification proposal
> 
> A frustrating aspect of Linux is the complexity of /etc configuration file's
> formats (rsyslog.conf, logrotate, cron, yum repo files, etc) In that spirit
> I would simplify the "select" in the data classification proposal (copied
> below) to only accept a list of bricks/sub-tiers with wild-cards '*', rather
> than full-blown regular expressions or key/value pairs. I would drop the
> "unclaimed" keyword, and not have keywords "media type", and "rack". It does
> not seem necessary to introduce new keys for the underlying block device
> type (SSD vs disk) any more than we need to express the filesystem (XFS vs
> ext4). In other words, I think tiering can be fully expressed in the
> configuration file while still abstracting the underlying storage. That
> said, the configuration file could be built up by a CLI or GUI, and richer
> expressibility could exist at that level.
> 
> example:
> 
> brick host1:/brick ssd-group0-1
> 
> brick host2:/brick ssd-group0-2
> 
> brick host3:/brick disk-group0-1
> 
> rule tier-1
>   select ssd-group0*
> 
> rule tier-2
>   select disk-group0
> 
> rule all
>   select tier-1
>   # use repeated "select" to establish order
>   select tier-2
>   type features/tiering
> 
> The filtering option's regular expressions seem hard to avoid. If just the
> name of the file satisfies most use cases (that we know of?) I do not think
> there is any way to avoid regular expressions in the option for filters.
> (Down the road, if we were to allow complete flexibility in how files can be
> distributed across subvolumes, the filtering problems may start to look
> similar to 90s-era packet classification with a solution along the lines of
> the Berkeley packet filter.)
> 
> There may be different rules by which data is distributed at the "tiering"
> level. For example, one tiering policy could be the fast tier (first
> listed). It would be a "cache" for the slow tier (second listed). I think
> the "option" keyword could handle that.
> 
> rule all
>   select tier-1
># use repeated "select" to establish order
>   select tier-2
>   type features/tiering
>   option tier-cache, mode=writeback, dirty-watermark=80
> 
> Another example tiering policy could be based on compliance ; when a file
> needs to become read-only, it moves from the first listed tier to the
> second.
> 
> rule all
>select tier-1
># use repeated "select" to establish order
>select tier-2
>type features/tiering
>   option tier-retention
> 
> - Original Message -
> From: "Jeff Darcy" 
> To: "Gluster Devel" 
> Sent: Friday, May 23, 2014 3:30:39 PM
> Subject: [Gluster-devel] Data classification proposal
> 
> One of the things holding up our data classification efforts (which include
> tiering but also other stuff as well) has been the extension of the same
> conceptual model from the I/O path to the configuration subsystem and
> ultimately to the user experience.  How does an administrator define a
> tiering policy without tearing their hair out?  How does s/he define a mixed
> replication/erasure-coding setup without wanting to rip *our* hair out?  The
> included Markdown document attempts to remedy this by proposing one out of
> many possible models and user interfaces.  It includes examples for some of
> the most common use cases, including the "replica 2.5" case w

Re: [Gluster-devel] Data classification proposal

2014-06-26 Thread Dan Lambright
I don't think brick splitting implemented by LVM would affect directory 
browsing any more than adding an additional brick would,

- Original Message -
From: "Justin Clift" 
To: "Dan Lambright" 
Cc: "Shyamsundar Ranganathan" , "Gluster Devel" 

Sent: Thursday, June 26, 2014 12:01:16 PM
Subject: Re: [Gluster-devel] Data classification proposal

On 26/06/2014, at 4:54 PM, Dan Lambright wrote:
> Implementing brick splitting using LVM would allow you to treat each logical 
> volume (split) as an independent brick. Each split would have its own 
> .glusterfs subdirectory. I think this would help with taking snapshots as 
> well.


Would brick splitting make directory browsing latency even scarier?

+ Justin

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-26 Thread Dan Lambright
Implementing brick splitting using LVM would allow you to treat each logical 
volume (split) as an independent brick. Each split would have its own 
.glusterfs subdirectory. I think this would help with taking snapshots as well.

- Original Message -
From: "Shyamsundar Ranganathan" 
To: "Krishnan Parthasarathi" 
Cc: "Gluster Devel" 
Sent: Thursday, June 26, 2014 11:13:48 AM
Subject: Re: [Gluster-devel] Data classification proposal

> > > For the short-term, wouldn't it be OK to disallow adding bricks that
> > > is not a multiple of group-size?
> > 
> > In the *very* short term, yes.  However, I think that will quickly
> > become an issue for users who try to deploy erasure coding because those
> > group sizes will be quite large.  As soon as we implement tiering, our
> > very next task - perhaps even before tiering gets into a release -
> > should be to implement automatic brick splitting.  That will bring other
> > benefits as well, such as variable replication levels to handle the
> > sanlock case, or overlapping replica sets to spread a failed brick's
> > load over more peers.
> > 
> 
> OK. Do you have some initial ideas on how we could 'split' bricks? I ask this
> to see if I can work on splitting bricks while the data classification format
> is
> being ironed out.

I see split bricks as creating a logical space for the new aggregate that the 
brick belongs to. This may not need data movement etc. but just a logical 
branching at the root of the brick for its membership. Are there counter 
examples to this?

Unless this changes the weight age of the brick across its aggregates, for 
example size based weight age for layout assignments, if we are considering 
schemes of that nature.

So I can see this as follows,

THE_Brick: /data/bricka

Belongs to: aggregate 1 and aggregate 2, so get the following structure beneath 
it,

/data/bricka/agg_1_ID/
/data/bricka/agg_2_ID/

Future splits of the bricks add more aggregate ID (not stating where or what 
this ID is, but assume this is something to distinguish aggregates) parents, 
and I would expect the xlator to send in requests into its aggregate parent and 
not root.

One issue that I see with this is, if we wanted to snap an aggregate then we 
would snap the entire brick.
Another is that how we distinguish the .glusterfs space across the aggregates?

Shyam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-26 Thread Justin Clift
On 26/06/2014, at 4:54 PM, Dan Lambright wrote:
> Implementing brick splitting using LVM would allow you to treat each logical 
> volume (split) as an independent brick. Each split would have its own 
> .glusterfs subdirectory. I think this would help with taking snapshots as 
> well.


Would brick splitting make directory browsing latency even scarier?

+ Justin

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-26 Thread Shyamsundar Ranganathan
> > > For the short-term, wouldn't it be OK to disallow adding bricks that
> > > is not a multiple of group-size?
> > 
> > In the *very* short term, yes.  However, I think that will quickly
> > become an issue for users who try to deploy erasure coding because those
> > group sizes will be quite large.  As soon as we implement tiering, our
> > very next task - perhaps even before tiering gets into a release -
> > should be to implement automatic brick splitting.  That will bring other
> > benefits as well, such as variable replication levels to handle the
> > sanlock case, or overlapping replica sets to spread a failed brick's
> > load over more peers.
> > 
> 
> OK. Do you have some initial ideas on how we could 'split' bricks? I ask this
> to see if I can work on splitting bricks while the data classification format
> is
> being ironed out.

I see split bricks as creating a logical space for the new aggregate that the 
brick belongs to. This may not need data movement etc. but just a logical 
branching at the root of the brick for its membership. Are there counter 
examples to this?

Unless this changes the weight age of the brick across its aggregates, for 
example size based weight age for layout assignments, if we are considering 
schemes of that nature.

So I can see this as follows,

THE_Brick: /data/bricka

Belongs to: aggregate 1 and aggregate 2, so get the following structure beneath 
it,

/data/bricka/agg_1_ID/
/data/bricka/agg_2_ID/

Future splits of the bricks add more aggregate ID (not stating where or what 
this ID is, but assume this is something to distinguish aggregates) parents, 
and I would expect the xlator to send in requests into its aggregate parent and 
not root.

One issue that I see with this is, if we wanted to snap an aggregate then we 
would snap the entire brick.
Another is that how we distinguish the .glusterfs space across the aggregates?

Shyam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-26 Thread Xavier Hernandez
On Wednesday 25 June 2014 11:42:10 Jeff Darcy wrote:
> > How space will be allocated to each new sub-brick ? some sort of thin-
> > provisioning or will it be distributed evenly on each split ?
> 
> That's left to the user.  The latest proposal, based on discussion of
> the first, is here:
> 
> https://docs.google.com/presentation/d/1e8tuh9DKNi9eCMrdt5vetppn1D3BiJSmfR7l
> DW2wRvA/edit?usp=sharing
> 

Thanks. I didn't know that document.

> That has an example of assigning percentages to the sub-bricks created
> by a rule (i.e. a subvolume in a potentially multi-tiered
> configuration).  Other possibilities include relative weights used to
> determine percentages, or total thin provisioning where sub-bricks
> compete freely for available space.  It's certainly a fruitful area for
> discussion.
> 
> > If using thin-provisioning, it will be hard to determine real
> > available space.  If using a fixed amount, we can get to scenarios
> > where a file cannot be written even if there seems to be enough free
> > space. This can already happen today if using very big files on almost
> > full bricks. I think brick splitting can accentuate this.
> 
> Is this really common outside of test environments, given the sizes of
> modern disks and files?  Even in cases where it might happen, doesn't
> striping address it?

Considering that SSD sizes are still relatively small and that each brick can 
be split many times depending on data classification rules, I don't see it as 
a rare case for some scenarios. Striping can solve the problem at the expense 
of increasing fault probability, requiring more SSD's to compensate.

> 
> We have a whole bunch of problems in this area.  If multiple bricks are
> on the same local file system, their capacity will be double-counted.
> If a second local file system is mounted over part of a brick, the
> additional space won't be counted at all.  We do need a general solution
> to this, but I don't think that solution needs to be part of data
> classification unless there's a specific real-world scenario that DC
> makes worse.
> 

Agreed. This is a problem that should be solved independently of data 
classification.

> > Also, the addition of multiple layered DHT translators, as it's
> > implemented today, could add a lot more of latency, specially on
> > directory listings.
> 
> With http://review.gluster.org/#/c/7702/ this should be less of a
> problem.

This solves one of the problems. Directory listing is still one of the worst 
problems I've found with gluster and I think it's not solved by this patch.

> Also, lookups across multiple tiers are likely to be rare in
> most use cases.  For example, for the name-based filtering (sanlock)
> case, a given file should only *ever* be in one tier so only that tier
> would need to be searched.  For the activity-based tiering case, the
> vast majority of lookups will be for hot files which are (not
> accidentally) in the first tier.

I think this is true as long as the rules are not modified. But if we allow to 
dynamically modify the rules once the volume is already running we will have 
the same problem as with rebalance, since there will be files not residing in 
the right tier for some time, and we need to find them nonetheless. This could 
be alleviated using something similar to the previous patch when the volume 
goes to a steady state again though.

> The only real problem is with *failed*
> lookups, e.g. during create.  We can address that by adding "stubs"
> (similar to linkfiles) in the upper tier, but I'd still want to wait
> until it's proven necessary.  What I would truly resist is any solution
> that involves building tier awareness directly into (one instance of)
> DHT.  Besides requiring a much larger development effort in the present,
> it would throw away the benefit of modularity and hamper other efforts
> in the future.  We need tiering and brick splitting *now*, especially as
> a complement to erasure coding which many won't be able to use
> otherwise.  As far as I can tell, stacking translators is the fastest
> way to get there.
> 

I agree that it's not good to create specific solutions for a problem when 
it's possible to make a more generic solution that could be used to add more 
features. However I'm not so sure that brick splitting is the best solution. 
Basically we need to solve two problems right now: tiering and volume growing 
brick by brick. Brick splitting is a way to implement it, but I don't think 
it's the only one.

> > Another problem I see is that splitting bricks will require a
> > rebalance, which is a costly operation. It doesn't seem right to
> > require a so expensive operation every time you add a new condition on
> > an already created volume.
> 
> Yes, rebalancing is expensive, but that's no different for split bricks
> than whole ones.  Any time you change the definition of what should go
> where, you'll have to move some data into compliance and that's
> expensive.  However, such operations ar

Re: [Gluster-devel] Data classification proposal

2014-06-25 Thread Jeff Darcy
> If I understand correctly the proposed data-classification
> architecture, each server will have a number of bricks that will be
> dynamically modified as needed: as more data-classifying conditions
> are defined, a new layer of translators will be added (a new DHT or
> AFR, or something else) and some or all existing bricks will be split
> to accommodate the new and, maybe, overlapping condition.

Correct.

> How space will be allocated to each new sub-brick ? some sort of thin-
> provisioning or will it be distributed evenly on each split ?

That's left to the user.  The latest proposal, based on discussion of
the first, is here:

https://docs.google.com/presentation/d/1e8tuh9DKNi9eCMrdt5vetppn1D3BiJSmfR7lDW2wRvA/edit?usp=sharing

That has an example of assigning percentages to the sub-bricks created
by a rule (i.e. a subvolume in a potentially multi-tiered
configuration).  Other possibilities include relative weights used to
determine percentages, or total thin provisioning where sub-bricks
compete freely for available space.  It's certainly a fruitful area for
discussion.

> If using thin-provisioning, it will be hard to determine real
> available space.  If using a fixed amount, we can get to scenarios
> where a file cannot be written even if there seems to be enough free
> space. This can already happen today if using very big files on almost
> full bricks. I think brick splitting can accentuate this.

Is this really common outside of test environments, given the sizes of
modern disks and files?  Even in cases where it might happen, doesn't
striping address it?

We have a whole bunch of problems in this area.  If multiple bricks are
on the same local file system, their capacity will be double-counted.
If a second local file system is mounted over part of a brick, the
additional space won't be counted at all.  We do need a general solution
to this, but I don't think that solution needs to be part of data
classification unless there's a specific real-world scenario that DC
makes worse.

> Also, the addition of multiple layered DHT translators, as it's
> implemented today, could add a lot more of latency, specially on
> directory listings.

With http://review.gluster.org/#/c/7702/ this should be less of a
problem.  Also, lookups across multiple tiers are likely to be rare in
most use cases.  For example, for the name-based filtering (sanlock)
case, a given file should only *ever* be in one tier so only that tier
would need to be searched.  For the activity-based tiering case, the
vast majority of lookups will be for hot files which are (not
accidentally) in the first tier.  The only real problem is with *failed*
lookups, e.g. during create.  We can address that by adding "stubs"
(similar to linkfiles) in the upper tier, but I'd still want to wait
until it's proven necessary.  What I would truly resist is any solution
that involves building tier awareness directly into (one instance of)
DHT.  Besides requiring a much larger development effort in the present,
it would throw away the benefit of modularity and hamper other efforts
in the future.  We need tiering and brick splitting *now*, especially as
a complement to erasure coding which many won't be able to use
otherwise.  As far as I can tell, stacking translators is the fastest
way to get there.

> Another problem I see is that splitting bricks will require a
> rebalance, which is a costly operation. It doesn't seem right to
> require a so expensive operation every time you add a new condition on
> an already created volume.

Yes, rebalancing is expensive, but that's no different for split bricks
than whole ones.  Any time you change the definition of what should go
where, you'll have to move some data into compliance and that's
expensive.  However, such operations are likely to be very rare.  It's
highly likely that most uses of this feature will consist of a simple
two-tier setup defined when the volume is created and never changed
thereafter, so the only rebalancing would be within a tier - i.e. the
exact same thing we do today in homogeneous volumes (maybe even slightly
better).  The only use case I can think of that would involve *frequent*
tier-config changes is multi-tenancy, but adding a new tenant should
only affect new data and not require migration of old data.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-25 Thread Xavier Hernandez
On Wednesday 25 June 2014 08:35:05 Jeff Darcy wrote:
> > For the short-term, wouldn't it be OK to disallow adding bricks that
> > is not a multiple of group-size?
> 
> In the *very* short term, yes.  However, I think that will quickly
> become an issue for users who try to deploy erasure coding because those
> group sizes will be quite large.  As soon as we implement tiering, our
> very next task - perhaps even before tiering gets into a release -
> should be to implement automatic brick splitting.  That will bring other
> benefits as well, such as variable replication levels to handle the
> sanlock case, or overlapping replica sets to spread a failed brick's
> load over more peers.

If I understand correctly the proposed data-classification architecture, each 
server will have a number of bricks that will be dynamically modified as 
needed: as more data-classifying conditions are defined, a new layer of 
translators will be added (a new DHT or AFR, or something else) and some or 
all existing bricks will be split to accommodate the new and, maybe, 
overlapping condition.

How space will be allocated to each new sub-brick ? some sort of thin-
provisioning or will it be distributed evenly on each split ?

If using thin-provisioning, it will be hard to determine real available space. 
If using a fixed amount, we can get to scenarios where a file cannot be 
written even if there seems to be enough free space. This can already happen 
today if using very big files on almost full bricks. I think brick splitting 
can accentuate this.

Also, the addition of multiple layered DHT translators, as it's implemented 
today, could add a lot more of latency, specially on directory listings.

Another problem I see is that splitting bricks will require a rebalance, which 
is a costly operation. It doesn't seem right to require a so expensive 
operation every time you add a new condition on an already created volume.

Maybe I've missed something important ?

Thanks,

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-25 Thread Krishnan Parthasarathi

- Original Message -
> > For the short-term, wouldn't it be OK to disallow adding bricks that
> > is not a multiple of group-size?
> 
> In the *very* short term, yes.  However, I think that will quickly
> become an issue for users who try to deploy erasure coding because those
> group sizes will be quite large.  As soon as we implement tiering, our
> very next task - perhaps even before tiering gets into a release -
> should be to implement automatic brick splitting.  That will bring other
> benefits as well, such as variable replication levels to handle the
> sanlock case, or overlapping replica sets to spread a failed brick's
> load over more peers.
> 

OK. Do you have some initial ideas on how we could 'split' bricks? I ask this
to see if I can work on splitting bricks while the data classification format is
being ironed out.

thanks,
Krish
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-25 Thread Jeff Darcy
> For the short-term, wouldn't it be OK to disallow adding bricks that
> is not a multiple of group-size?

In the *very* short term, yes.  However, I think that will quickly
become an issue for users who try to deploy erasure coding because those
group sizes will be quite large.  As soon as we implement tiering, our
very next task - perhaps even before tiering gets into a release -
should be to implement automatic brick splitting.  That will bring other
benefits as well, such as variable replication levels to handle the
sanlock case, or overlapping replica sets to spread a failed brick's
load over more peers.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-24 Thread Krishnan Parthasarathi
Jeff,

- Original Message -
> > Am I right if I understood that the value for media-type is not
> > interpreted beyond the scope of matching rules? That is to say, we
> > don't need/have any notion of media-types that type check internally
> > for forming (sub)volumes using the rules specified.
> 
> Exactly.  To us it's just an opaque ID.

OK. That makes sense.

> 
> > Should the no. of bricks or lower-level subvolumes that match the rule
> > be an exact multiple of group-size?
> 
> Good question.  I think users see the current requirement to add bricks
> in multiples of the replica/stripe size as an annoyance.  This will only
> get worse with erasure coding where the group size is larger.  On the
> other hand, we do need to make sure that members of a group are on
> different machines.  This is why I think we need to be able to split
> bricks, so that we can use overlapping replica/erasure sets.  For
> example, if we have five bricks and two-way replication, we can split
> bricks to get a multiple of two and life's good again.  So *long term* I
> think we can/should remove any restriction on users, but there are a
> whole bunch of unsolved issues around brick splitting.  I'm not sure
> what to do in the short term.

For the short-term, wouldn't it be OK to disallow adding bricks that is not
a multiple of group-size?

> 
> > > Here's a more complex example that adds replication and erasure
> > > coding to the mix.
> > >
> > > # Assume 20 hosts, four fast and sixteen slow (named
> > > appropriately).
> > >
> > > rule tier-1
> > > select *fast*
> > > group-size 2
> > > type cluster/afr
> > >
> > > rule tier-2
> > > # special pattern matching otherwise-unused bricks
> > > select %{unclaimed}
> > > group-size 8
> > > type cluster/ec parity=2
> > > # i.e. two groups, each six data plus two parity
> > >
> > > rule all
> > > select tier-1
> > > select tier-2
> > > type features/tiering
> > >
> >
> > In the above example we would have 2 subvolumes each containing 2
> > bricks that would be aggregated by rule tier-1. Lets call those
> > subvolumes as tier-1-fast-0 and tier-fast-1.  Both of these subvolumes
> > are afr based two-way replicated subvolumes.  Are these instances of
> > tier-1-* composed using cluster/dht by the default semantics?
> 
> Yes.  Any time we have multiple subvolumes and no other specified way to
> combine them into one, we just slap DHT on top.  We do this already at
> the top level; with data classification we might do it at lower levels
> too.
> 

thanks,
Krish
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-24 Thread Jeff Darcy
> Its possible to express your example using lists if their entries are allowed
> to overlap. I see that you wanted a way to express a matrix (overlapping
> rules) with gluster's tree-like syntax as backdrop.
> 
> A polytree may be a better term than matrix (DAG without cycles), i.e. when
> there are overlaps a node in the graph gets multiple in-arcs.
> 
> Syntax aside, we seem to part on "where" to solve the problem- config file or
> UX. I prefer the UX have the logic to build the configuration file, given
> how complex it can be. My preference would be for the config file be mostly
> "read only" with extremely simple syntax.
> 
> I'll put some more thought into this and believe this discussion has
> illuminated some good points.
> 
> Brick: host1:/SSD1  SSD1
> Brick: host1:/SSD2  SSD2
> Brick: host2:/SSD3  SSD3
> Brick: host2:/SSD4  SSD4
> Brick: host1:/DISK1 DISK1
> 
> rule rack4:
>   select SSD1, SSD2, DISK1
> 
> # some files should go on ssds in rack 4
> rule A:
>   option filter-condition *.lock
>   select SSD1, SSD2
> 
> # some files should go on ssds anywhere
> rule B:
>   option filter-condition *.out
>   select SSD1, SSD2, SSD3, SSD4
> 
> # some files should go anywhere in rack 4
> rule C
>   option filter-condition *.c
>   select rack4
> 
> # some files we just don't care
> rule D
>   option filter-condition *.h
>   select SSD1, SSD2, SSD3, SSD4, DISK1
> 
> volume:
>   option filter-condition A,B,C,D

This seems to leave us with two options.  One option is that "select"
supports only explicit enumeration, so that adding a brick means editing
multiple rules that apply to it.  The other option is that "select"
supports wildcards.  Using a regex to match parts of a name is
effectively the same as matching the explicit tags we started with,
except that expressing complex Boolean conditions using a regex can get
more than a bit messy.  As Jamie Zawinski famously said:

> Some people, when confronted with a problem, think "I know, I'll use
> regular expressions." Now they have two problems.

I think it's nice to support regexes instead of plain strings in
lower-level rules, but relying on them alone to express complex
higher-level policies would IMO be a mistake.  Likewise, defining a
proper syntax for a config file seems both more flexible and easier than
defining one for a CLI, where the parsing options are even more limited.
What happens when someone wants to use Puppet (for example) to set this
up?  Then the user would express their will in Puppet syntax, which
would have to convert it to our CLI syntax, which would convert it to
our config-file syntax.  Why not allow them to skip a step where
information might get lost or mangled in translation?  We can still have
CLI commands to do the most common kinds of manipulation, as we do for
volfiles, but the final form can be more extensible.  It will still be
more comprehensible than Ceph's CRUSH maps.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-24 Thread Dan Lambright

Its possible to express your example using lists if their entries are allowed 
to overlap. I see that you wanted a way to express a matrix (overlapping rules) 
with gluster's tree-like syntax as backdrop. 

A polytree may be a better term than matrix (DAG without cycles), i.e. when 
there are overlaps a node in the graph gets multiple in-arcs.

Syntax aside, we seem to part on "where" to solve the problem- config file or 
UX. I prefer the UX have the logic to build the configuration file, given how 
complex it can be. My preference would be for the config file be mostly "read 
only" with extremely simple syntax. 

I'll put some more thought into this and believe this discussion has 
illuminated some good points.

Brick: host1:/SSD1  SSD1
Brick: host1:/SSD2  SSD2
Brick: host2:/SSD3  SSD3
Brick: host2:/SSD4  SSD4
Brick: host1:/DISK1 DISK1

rule rack4: 
  select SSD1, SSD2, DISK1

# some files should go on ssds in rack 4
rule A: 
  option filter-condition *.lock
  select SSD1, SSD2

# some files should go on ssds anywhere
rule B: 
  option filter-condition *.out
  select SSD1, SSD2, SSD3, SSD4

# some files should go anywhere in rack 4
rule C 
  option filter-condition *.c
  select rack4

# some files we just don't care
rule D
  option filter-condition *.h
  select SSD1, SSD2, SSD3, SSD4, DISK1

volume:
  option filter-condition A,B,C,D

- Original Message -
From: "Jeff Darcy" 
To: "Dan Lambright" 
Cc: "Gluster Devel" 
Sent: Monday, June 23, 2014 7:11:44 PM
Subject: Re: [Gluster-devel] Data classification proposal

> Rather than using the keyword "unclaimed", my instinct was to
> explicitly list which bricks have not been "claimed".  Perhaps you
> have something more subtle in mind, it is not apparent to me from your
> response. Can you provide an example of why it is necessary and a list
> could not be provided in its place? If the list is somehow "difficult
> to figure out", due to a particularly complex setup or some such, I'd
> prefer a CLI/GUI build that list rather than having sysadmins
> hand-edit this file.

It's not *difficult* to make sure every brick has been enumerated by
some rule, and that there are no overlaps, but it's certainly tedious
and error prone.  Imagine that a user has four has bricks in four
machines, using names like serv1-b1, serv1-b2, ..., serv4-b6.
Accordingly, they've set up rules to put serv1* into one set and
serv[234]* into another set (which is already more flexibility than I
think your proposal gave them).  Now when they add serv5 they need an
extra step to add it to the tiering config, which wouldn't have been
necessary if we supported defaults.  What percentage of users would
forget that step at least once?  I don't know for sure, but I'd guess
it's pretty high.

Having a CLI or GUI create configs just means that we have to add
support for defaults there instead.  We'd still have to implement the
same logic, they'd still have to specify the same thing.  That just
seems like moving the problem around instead of solving it.

> The key-value piece seems like syntactic sugar - an "alias". If so,
> let the name itself be the alias. No notions of SSD or physical
> location need be inserted. Unless I am missing that it *is* necessary,
> I stand by that value judgement as a philosophy of not putting
> anything into the configuration file that you don't require. Can you
> provide an example of where it is necessary?

OK...
-


Brick: SSD1
Brick: SSD2
Brick: SSD3
Brick: SSD4
Brick: DISK1

rack4: SSD1, SSD2, DISK1

filter A : SSD1, SSD2

filter B : SSD1,SSD2, SSD3, SSD4

filter C: rack4

filter D: SSD1, SSD2, SSD3, SSD4, DISK1

meta-filter: filter A, filter B, filter C, filter D

  * some files should go on ssds in rack 4

  * some files should go on ssds anywhere

  * some files should go anywhere in rack 4

  * some files we just don't care

Notice how the rules *overlap*.  We can't support that if our syntax
only allows the user to express a list (or list of lists).  If the list
is ordered by type, we can't also support location-based rules.  If the
list is ordered by location, we lose type-based rules instead.   Brick
properties create a matrix, with an unknown number of dimensions (e.g.
security level, tenant ID, and so on as well as type and location).  The
logical way to represent such a space for rule-matching purposes is to
let users define however many dimensions (keys) as they want and as many
values for each dimension as they want.

Whether the exact string "type" or "unclaimed" appears anywhere isn't
the issue.  What matters is that the *semantics* of assigning properties
to a brick have to be more sophisticated than just assigning each a
position in a list, and we need a syntax that supports those semantics.
Otherw

Re: [Gluster-devel] Data classification proposal

2014-06-24 Thread Jeff Darcy
> Am I right if I understood that the value for media-type is not
> interpreted beyond the scope of matching rules? That is to say, we
> don't need/have any notion of media-types that type check internally
> for forming (sub)volumes using the rules specified.

Exactly.  To us it's just an opaque ID.

> Should the no. of bricks or lower-level subvolumes that match the rule
> be an exact multiple of group-size?

Good question.  I think users see the current requirement to add bricks
in multiples of the replica/stripe size as an annoyance.  This will only
get worse with erasure coding where the group size is larger.  On the
other hand, we do need to make sure that members of a group are on
different machines.  This is why I think we need to be able to split
bricks, so that we can use overlapping replica/erasure sets.  For
example, if we have five bricks and two-way replication, we can split
bricks to get a multiple of two and life's good again.  So *long term* I
think we can/should remove any restriction on users, but there are a
whole bunch of unsolved issues around brick splitting.  I'm not sure
what to do in the short term.

> > Here's a more complex example that adds replication and erasure
> > coding to the mix.
> >
> > # Assume 20 hosts, four fast and sixteen slow (named
> > appropriately).
> >
> > rule tier-1
> > select *fast*
> > group-size 2
> > type cluster/afr
> >
> > rule tier-2
> > # special pattern matching otherwise-unused bricks
> > select %{unclaimed}
> > group-size 8
> > type cluster/ec parity=2
> > # i.e. two groups, each six data plus two parity
> >
> > rule all
> > select tier-1
> > select tier-2
> > type features/tiering
> >
>
> In the above example we would have 2 subvolumes each containing 2
> bricks that would be aggregated by rule tier-1. Lets call those
> subvolumes as tier-1-fast-0 and tier-fast-1.  Both of these subvolumes
> are afr based two-way replicated subvolumes.  Are these instances of
> tier-1-* composed using cluster/dht by the default semantics?

Yes.  Any time we have multiple subvolumes and no other specified way to
combine them into one, we just slap DHT on top.  We do this already at
the top level; with data classification we might do it at lower levels
too.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-24 Thread Krishnan Parthasarathi
Jeff,

I have a few questions regarding the rules syntax and how they apply.
I think this is different in spirit from the discussion Dan has started
and keeping it separate. See questions inline.

- Original Message -
> One of the things holding up our data classification efforts (which include
> tiering but also other stuff as well) has been the extension of the same
> conceptual model from the I/O path to the configuration subsystem and
> ultimately to the user experience.  How does an administrator define a
> tiering policy without tearing their hair out?  How does s/he define a mixed
> replication/erasure-coding setup without wanting to rip *our* hair out?  The
> included Markdown document attempts to remedy this by proposing one out of
> many possible models and user interfaces.  It includes examples for some of
> the most common use cases, including the "replica 2.5" case we'e been
> discussing recently.  Constructive feedback would be greatly appreciated.
> 
> 
> 
> # Data Classification Interface
> 
> The data classification feature is extremely flexible, to cover use cases
> from
> SSD/disk tiering to rack-aware placement to security or other policies.  With
> this flexibility comes complexity.  While this complexity does not affect the
> I/O path much, it does affect both the volume-configuration subsystem and the
> user interface to set placement policies.  This document describes one
> possible
> model and user interface.
> 
> The model we used is based on two kinds of information: brick descriptions
> and
> aggregation rules.  Both are contained in a configuration file (format TBD)
> which can be associated with a volume using a volume option.
> 
> ## Brick Descriptions
> 
> A brick is described by a series of simple key/value pairs.  Predefined keys
> include:
> 
>  * **media-type**
>The underlying media type for the brick.  In its simplest form this might
>just be *ssd* or *disk*.  More sophisticated users might use something
>like
>*15krpm* to represent a faster disk, or *perc-raid5* to represent a brick
>backed by a RAID controller.

Am I right if I understood that the value for media-type is not interpreted 
beyond the
scope of matching rules? That is to say, we don't need/have any notion of 
media-types
that type check internally for forming (sub)volumes using the rules specified.

> 
>  * **rack** (and/or **row**)
>The physical location of the brick.  Some policy rules might be set up to
>spread data across more than one rack.
> 
> User-defined keys are also allowed.  For example, some users might use a
> *tenant* or *security-level* tag as the basis for their placement policy.
> 
> ## Aggregation Rules
> 
> Aggregation rules are used to define how bricks should be combined into
> subvolumes, and those potentially combined into higher-level subvolumes, and
> so
> on until all of the bricks are accounted for.  Each aggregation rule consists
> of the following parts:
> 
>  * **id**
>The base name of the subvolumes the rule will create.  If a rule is
>applied
>multiple times this will yield *id-0*, *id-1*, and so on.
> 
>  * **selector**
>A "filter" for which bricks or lower-level subvolumes the rule will
>aggregate.  This is an expression similar to a *WHERE* clause in SQL,
>using
>brick/subvolume names and properties in lieu of columns.  These values are
>then matched against literal values or regular expressions, using the
>usual
>set of boolean operators to arrive at a *yes* or *no* answer to the
>question
>of whether this brick/subvolume is affected by this rule.
> 
>  * **group-size** (optional)
>The number of original bricks/subvolumes to be combined into each produced
>subvolume.  The special default value zero means to collect all original
>bricks or subvolumes into one final subvolume.  In this case, *id* is used
>directly instead of having a numeric suffix appended.

Should the no. of bricks or lower-level subvolumes that match the rule be an 
exact
multiple of group-size?

> 
>  * **type** (optional)
>The type of the generated translator definition(s).  Examples might
>include
>"AFR" to do replication, "EC" to do erasure coding, and so on.  The more
>general data classification task includes the definition of new
>translators
>to do tiering and other kinds of filtering, but those are beyond the scope
>of this document.  If no type is specified, cluster/dht will be used to do
>random placement among its constituents.
> 
>  * **tag** and **option** (optional, repeatable)
>Additional tags and/or options to be applied to each newly created
>subvolume.  See the "replica 2.5" example to see how this can be used.
> 
> Since each type might have unique requirements, such as ensuring that
> replication is done across machines or racks whenever possible, it is assumed
> that there will be corresponding type-specific scripts or functions to do the
> a

Re: [Gluster-devel] Data classification proposal

2014-06-23 Thread Jeff Darcy
> Rather than using the keyword "unclaimed", my instinct was to
> explicitly list which bricks have not been "claimed".  Perhaps you
> have something more subtle in mind, it is not apparent to me from your
> response. Can you provide an example of why it is necessary and a list
> could not be provided in its place? If the list is somehow "difficult
> to figure out", due to a particularly complex setup or some such, I'd
> prefer a CLI/GUI build that list rather than having sysadmins
> hand-edit this file.

It's not *difficult* to make sure every brick has been enumerated by
some rule, and that there are no overlaps, but it's certainly tedious
and error prone.  Imagine that a user has four has bricks in four
machines, using names like serv1-b1, serv1-b2, ..., serv4-b6.
Accordingly, they've set up rules to put serv1* into one set and
serv[234]* into another set (which is already more flexibility than I
think your proposal gave them).  Now when they add serv5 they need an
extra step to add it to the tiering config, which wouldn't have been
necessary if we supported defaults.  What percentage of users would
forget that step at least once?  I don't know for sure, but I'd guess
it's pretty high.

Having a CLI or GUI create configs just means that we have to add
support for defaults there instead.  We'd still have to implement the
same logic, they'd still have to specify the same thing.  That just
seems like moving the problem around instead of solving it.

> The key-value piece seems like syntactic sugar - an "alias". If so,
> let the name itself be the alias. No notions of SSD or physical
> location need be inserted. Unless I am missing that it *is* necessary,
> I stand by that value judgement as a philosophy of not putting
> anything into the configuration file that you don't require. Can you
> provide an example of where it is necessary?

OK...

  * some files should go on ssds in rack 4

  * some files should go on ssds anywhere

  * some files should go anywhere in rack 4

  * some files we just don't care

Notice how the rules *overlap*.  We can't support that if our syntax
only allows the user to express a list (or list of lists).  If the list
is ordered by type, we can't also support location-based rules.  If the
list is ordered by location, we lose type-based rules instead.   Brick
properties create a matrix, with an unknown number of dimensions (e.g.
security level, tenant ID, and so on as well as type and location).  The
logical way to represent such a space for rule-matching purposes is to
let users define however many dimensions (keys) as they want and as many
values for each dimension as they want.

Whether the exact string "type" or "unclaimed" appears anywhere isn't
the issue.  What matters is that the *semantics* of assigning properties
to a brick have to be more sophisticated than just assigning each a
position in a list, and we need a syntax that supports those semantics.
Otherwise we'll end up solving the same UX problems again and again each
time we add a feature that involves treating bricks or data differently.
Each time we'll probably do it a little differently and confuse users a
little more, if history is any guide.  That's what I'd rather avoid.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-23 Thread Dan Lambright
Rather than using the keyword "unclaimed", my instinct was to explicitly list 
which bricks have not been "claimed".  Perhaps you have something more subtle 
in mind, it is not apparent to me from your response. Can you provide an 
example of why it is necessary and a list could not be provided in its place? 
If the list is somehow "difficult to figure out", due to a particularly complex 
setup or some such, I'd prefer a CLI/GUI build that list rather than having 
sysadmins hand-edit this file.

The key-value piece seems like syntactic sugar - an "alias". If so, let the 
name itself be the alias. No notions of SSD or physical location need be 
inserted. Unless I am missing that it *is* necessary, I stand by that value 
judgement as a philosophy of not putting anything into the configuration file 
that you don't require. Can you provide an example of where it is necessary?

As to your point on filtering (which files go into which tier/group). I wrote a 
little further in the email that I do not see a way around regular expressions 
within the filter-condition keyword. My understanding of your proposal is the 
select statement did not do file name filtering, the "filter-condition" option 
did. I'm ok with that.

As far as the "user stories" idea goes, that seems like a good next step.

- Original Message -
From: "Jeff Darcy" 
To: "Dan Lambright" 
Cc: "Gluster Devel" 
Sent: Monday, June 23, 2014 5:24:14 PM
Subject: Re: [Gluster-devel] Data classification proposal

> A frustrating aspect of Linux is the complexity of /etc configuration file's
> formats (rsyslog.conf, logrotate, cron, yum repo files, etc) In that spirit
> I would simplify the "select" in the data classification proposal (copied
> below) to only accept a list of bricks/sub-tiers with wild-cards '*', rather
> than full-blown regular expressions or key/value pairs.

Then how does *the user* specify which files should go into which tier/group?
If we don't let them specify that in configuration, then it can only be done
in code and we've taken a choice away from them.

> I would drop the
> "unclaimed" keyword

Then how do you specify any kind of default rule for files not matched
elsewhere?  If certain files can be placed only in certain locations due to
security or compliance considerations, how would they specify the location(s)
for files not subject to any such limitation?

> and not have keywords "media type", and "rack". It does
> not seem necessary to introduce new keys for the underlying block device
> type (SSD vs disk) any more than we need to express the filesystem (XFS vs
> ext4).

The idea is to let users specify whatever criteria matter *to them*; media
type and rack/row are just examples to get them started.

> In other words, I think tiering can be fully expressed in the
> configuration file while still abstracting the underlying storage.

Yes, *tiering* can be expressed using a simpler syntax.  I was trying for
something that could also support placement policies other than strict
linear "above" vs. "below" with only the migration policies we've written
into code.

> That
> said, the configuration file could be built up by a CLI or GUI, and richer
> expressibility could exist at that level.
> 
> example:
> 
> brick host1:/brick ssd-group0-1
> 
> brick host2:/brick ssd-group0-2
> 
> brick host3:/brick disk-group0-1
> 
> rule tier-1
>   select ssd-group0*
> 
> rule tier-2
>   select disk-group0
> 
> rule all
>   select tier-1
>   # use repeated "select" to establish order
>   select tier-2
>   type features/tiering
> 
> The filtering option's regular expressions seem hard to avoid. If just the
> name of the file satisfies most use cases (that we know of?) I do not think
> there is any way to avoid regular expressions in the option for filters.
> (Down the road, if we were to allow complete flexibility in how files can be
> distributed across subvolumes, the filtering problems may start to look
> similar to 90s-era packet classification with a solution along the lines of
> the Berkeley packet filter.)
> 
> There may be different rules by which data is distributed at the "tiering"
> level. For example, one tiering policy could be the fast tier (first
> listed). It would be a "cache" for the slow tier (second listed). I think
> the "option" keyword could handle that.
> 
> rule all
>   select tier-1
># use repeated "select" to establish order
>   select tier-2
>   type features/tiering
>   option tier-cache, mode=writeback, dirty-watermark=80
> 
> Another example 

Re: [Gluster-devel] Data classification proposal

2014-06-23 Thread Jeff Darcy
> A frustrating aspect of Linux is the complexity of /etc configuration file's
> formats (rsyslog.conf, logrotate, cron, yum repo files, etc) In that spirit
> I would simplify the "select" in the data classification proposal (copied
> below) to only accept a list of bricks/sub-tiers with wild-cards '*', rather
> than full-blown regular expressions or key/value pairs.

Then how does *the user* specify which files should go into which tier/group?
If we don't let them specify that in configuration, then it can only be done
in code and we've taken a choice away from them.

> I would drop the
> "unclaimed" keyword

Then how do you specify any kind of default rule for files not matched
elsewhere?  If certain files can be placed only in certain locations due to
security or compliance considerations, how would they specify the location(s)
for files not subject to any such limitation?

> and not have keywords "media type", and "rack". It does
> not seem necessary to introduce new keys for the underlying block device
> type (SSD vs disk) any more than we need to express the filesystem (XFS vs
> ext4).

The idea is to let users specify whatever criteria matter *to them*; media
type and rack/row are just examples to get them started.

> In other words, I think tiering can be fully expressed in the
> configuration file while still abstracting the underlying storage.

Yes, *tiering* can be expressed using a simpler syntax.  I was trying for
something that could also support placement policies other than strict
linear "above" vs. "below" with only the migration policies we've written
into code.

> That
> said, the configuration file could be built up by a CLI or GUI, and richer
> expressibility could exist at that level.
> 
> example:
> 
> brick host1:/brick ssd-group0-1
> 
> brick host2:/brick ssd-group0-2
> 
> brick host3:/brick disk-group0-1
> 
> rule tier-1
>   select ssd-group0*
> 
> rule tier-2
>   select disk-group0
> 
> rule all
>   select tier-1
>   # use repeated "select" to establish order
>   select tier-2
>   type features/tiering
> 
> The filtering option's regular expressions seem hard to avoid. If just the
> name of the file satisfies most use cases (that we know of?) I do not think
> there is any way to avoid regular expressions in the option for filters.
> (Down the road, if we were to allow complete flexibility in how files can be
> distributed across subvolumes, the filtering problems may start to look
> similar to 90s-era packet classification with a solution along the lines of
> the Berkeley packet filter.)
> 
> There may be different rules by which data is distributed at the "tiering"
> level. For example, one tiering policy could be the fast tier (first
> listed). It would be a "cache" for the slow tier (second listed). I think
> the "option" keyword could handle that.
> 
> rule all
>   select tier-1
># use repeated "select" to establish order
>   select tier-2
>   type features/tiering
>   option tier-cache, mode=writeback, dirty-watermark=80
> 
> Another example tiering policy could be based on compliance ; when a file
> needs to become read-only, it moves from the first listed tier to the
> second.
> 
> rule all
>select tier-1
># use repeated "select" to establish order
>select tier-2
>type features/tiering
>   option tier-retention

OK, good so far.  How would you handle the "replica 2.5" sanlock case with
the simplified syntax?  Or security-aware placement equivalent to this?

   rule secure
  select brick-0-*
  option encryption on

   rule insecure
  select brick-1-*
  option encryption off

   rule all
  select secure
  select insecure
  type features/filter
  option filter-condition-1 security-level:high
  option filter-target-1 secure
  option default-subvol insecure

In true agile fashion, maybe we should compile a set of "user stories" and
treat those as test cases for any proposed syntax.  That would need to
include at least

   * hot/cold tiering

   * HIPPA/EUPD style compliance (file must *always* or *never* be in X)

   * security-aware placement

   * multi-tenancy

   * sanlock case

I'm not trying to create complexity for its own sake.  If there's a
simpler syntax that doesn't eliminate some of these cases in favor of
tiering and nothing else, that would be great.

> - Original Message -
> From: "Jeff Darcy" 
> To: "Gluster Devel" 
> Sent: Friday, May 23, 2014 3:30:39 PM
> Subject: [Gluster-devel] Data classification propo

Re: [Gluster-devel] Data classification proposal

2014-06-23 Thread Dan Lambright
A frustrating aspect of Linux is the complexity of /etc configuration file's 
formats (rsyslog.conf, logrotate, cron, yum repo files, etc) In that spirit I 
would simplify the "select" in the data classification proposal (copied below) 
to only accept a list of bricks/sub-tiers with wild-cards '*', rather than 
full-blown regular expressions or key/value pairs. I would drop the "unclaimed" 
keyword, and not have keywords "media type", and "rack". It does not seem 
necessary to introduce new keys for the underlying block device type (SSD vs 
disk) any more than we need to express the filesystem (XFS vs ext4). In other 
words, I think tiering can be fully expressed in the configuration file while 
still abstracting the underlying storage. That said, the configuration file 
could be built up by a CLI or GUI, and richer expressibility could exist at 
that level.

example:

brick host1:/brick ssd-group0-1

brick host2:/brick ssd-group0-2

brick host3:/brick disk-group0-1

rule tier-1
select ssd-group0*

rule tier-2
select disk-group0

rule all
select tier-1
# use repeated "select" to establish order
select tier-2
type features/tiering

The filtering option's regular expressions seem hard to avoid. If just the name 
of the file satisfies most use cases (that we know of?) I do not think there is 
any way to avoid regular expressions in the option for filters. (Down the road, 
if we were to allow complete flexibility in how files can be distributed across 
subvolumes, the filtering problems may start to look similar to 90s-era packet 
classification with a solution along the lines of the Berkeley packet filter.)

There may be different rules by which data is distributed at the "tiering" 
level. For example, one tiering policy could be the fast tier (first listed). 
It would be a "cache" for the slow tier (second listed). I think the "option" 
keyword could handle that.

rule all
select tier-1
 # use repeated "select" to establish order
select tier-2
type features/tiering
option tier-cache, mode=writeback, dirty-watermark=80

Another example tiering policy could be based on compliance ; when a file needs 
to become read-only, it moves from the first listed tier to the second.

rule all
 select tier-1
 # use repeated "select" to establish order
 select tier-2
 type features/tiering
option tier-retention

- Original Message -
From: "Jeff Darcy" 
To: "Gluster Devel" 
Sent: Friday, May 23, 2014 3:30:39 PM
Subject: [Gluster-devel] Data classification proposal

One of the things holding up our data classification efforts (which include 
tiering but also other stuff as well) has been the extension of the same 
conceptual model from the I/O path to the configuration subsystem and 
ultimately to the user experience.  How does an administrator define a tiering 
policy without tearing their hair out?  How does s/he define a mixed 
replication/erasure-coding setup without wanting to rip *our* hair out?  The 
included Markdown document attempts to remedy this by proposing one out of many 
possible models and user interfaces.  It includes examples for some of the most 
common use cases, including the "replica 2.5" case we'e been discussing 
recently.  Constructive feedback would be greatly appreciated.



# Data Classification Interface

The data classification feature is extremely flexible, to cover use cases from
SSD/disk tiering to rack-aware placement to security or other policies.  With
this flexibility comes complexity.  While this complexity does not affect the
I/O path much, it does affect both the volume-configuration subsystem and the
user interface to set placement policies.  This document describes one possible
model and user interface.

The model we used is based on two kinds of information: brick descriptions and
aggregation rules.  Both are contained in a configuration file (format TBD)
which can be associated with a volume using a volume option.

## Brick Descriptions

A brick is described by a series of simple key/value pairs.  Predefined keys
include:

 * **media-type**  
   The underlying media type for the brick.  In its simplest form this might
   just be *ssd* or *disk*.  More sophisticated users might use something like
   *15krpm* to represent a faster disk, or *perc-raid5* to represent a brick
   backed by a RAID controller.

 * **rack** (and/or **row**)  
   The physical location of the brick.  Some policy rules might be set up to
   spread data across more than one rack.

User-defined keys are also allowed.  For example, some users might use a
*tenant* or *security-level* tag as the basis for their placement policy.

## Aggregation Rules

Aggregation rules are used to define how bricks should be combined into
subvo

[Gluster-devel] Data classification proposal

2014-05-23 Thread Jeff Darcy
One of the things holding up our data classification efforts (which include 
tiering but also other stuff as well) has been the extension of the same 
conceptual model from the I/O path to the configuration subsystem and 
ultimately to the user experience.  How does an administrator define a tiering 
policy without tearing their hair out?  How does s/he define a mixed 
replication/erasure-coding setup without wanting to rip *our* hair out?  The 
included Markdown document attempts to remedy this by proposing one out of many 
possible models and user interfaces.  It includes examples for some of the most 
common use cases, including the "replica 2.5" case we'e been discussing 
recently.  Constructive feedback would be greatly appreciated.



# Data Classification Interface

The data classification feature is extremely flexible, to cover use cases from
SSD/disk tiering to rack-aware placement to security or other policies.  With
this flexibility comes complexity.  While this complexity does not affect the
I/O path much, it does affect both the volume-configuration subsystem and the
user interface to set placement policies.  This document describes one possible
model and user interface.

The model we used is based on two kinds of information: brick descriptions and
aggregation rules.  Both are contained in a configuration file (format TBD)
which can be associated with a volume using a volume option.

## Brick Descriptions

A brick is described by a series of simple key/value pairs.  Predefined keys
include:

 * **media-type**  
   The underlying media type for the brick.  In its simplest form this might
   just be *ssd* or *disk*.  More sophisticated users might use something like
   *15krpm* to represent a faster disk, or *perc-raid5* to represent a brick
   backed by a RAID controller.

 * **rack** (and/or **row**)  
   The physical location of the brick.  Some policy rules might be set up to
   spread data across more than one rack.

User-defined keys are also allowed.  For example, some users might use a
*tenant* or *security-level* tag as the basis for their placement policy.

## Aggregation Rules

Aggregation rules are used to define how bricks should be combined into
subvolumes, and those potentially combined into higher-level subvolumes, and so
on until all of the bricks are accounted for.  Each aggregation rule consists
of the following parts:

 * **id**  
   The base name of the subvolumes the rule will create.  If a rule is applied
   multiple times this will yield *id-0*, *id-1*, and so on.

 * **selector**  
   A "filter" for which bricks or lower-level subvolumes the rule will
   aggregate.  This is an expression similar to a *WHERE* clause in SQL, using
   brick/subvolume names and properties in lieu of columns.  These values are
   then matched against literal values or regular expressions, using the usual
   set of boolean operators to arrive at a *yes* or *no* answer to the question
   of whether this brick/subvolume is affected by this rule.

 * **group-size** (optional)  
   The number of original bricks/subvolumes to be combined into each produced
   subvolume.  The special default value zero means to collect all original
   bricks or subvolumes into one final subvolume.  In this case, *id* is used
   directly instead of having a numeric suffix appended.

 * **type** (optional)  
   The type of the generated translator definition(s).  Examples might include
   "AFR" to do replication, "EC" to do erasure coding, and so on.  The more
   general data classification task includes the definition of new translators
   to do tiering and other kinds of filtering, but those are beyond the scope
   of this document.  If no type is specified, cluster/dht will be used to do
   random placement among its constituents.

 * **tag** and **option** (optional, repeatable)  
   Additional tags and/or options to be applied to each newly created
   subvolume.  See the "replica 2.5" example to see how this can be used.

Since each type might have unique requirements, such as ensuring that
replication is done across machines or racks whenever possible, it is assumed
that there will be corresponding type-specific scripts or functions to do the
actual aggregation.  This might even be made pluggable some day (TBD).  Once
all rule-based aggregation has been done, volume options are applied similarly
to how they are now.

Astute readers might have noticed that it's possible for a brick to be
aggregated more than once.  This is intentional.  If a brick is part of
multiple aggregates, it will be automatically split into multiple bricks
internally but this will be invisible to the user.

## Examples

Let's start with a simple tiering example.  Here's what the data-classification
config file might look like.

brick host1:/brick
media-type = ssd

brick host2:/brick
media-type = disk

brick host3:/brick
media-type = disk

rule tier-1