Bengoechea Bartolomé Enrique (SIES 73) wrote:
Very good points. They closely match the current prototype I have
written...
Starting by working on an interface for such object(s) is probably
the first step toward a unified solution
Agree. Getting a good API is always the most important step.
Dimension-level is what seems to the be most needed...
True, and that was Henrik's original suggestion.
But I find all three
are closely related to the same topic (metadata) and as such deserve
to be worked out together, but if most people agree otherwise, the
direction is clear.
- Object-level, if not linked to any dimension-attribute is such
saying that one want to attach anything to any object. That's what
attr() is already doing.
Except that plain attributes are dropped when subsetting. I've found
myself dozens of times creating classes must to create a `[` method
for them that preserves some attributes. This looks like such a
common situation that having a mechanism to avoid the user
programming the same stuff again and again would be handy.
I see. I never faced the issue, but I agree that this can be somehow
counter-intuitive.
Thinking about it, it seems natural nowadays to consider
attributes-associated objects as a kind of prototype-based programming
(and "[" to keep the attributes - although it does somehow consider
special attributes such as "dim", "names", "dimnames").
- Cell-level, is may be out-of-scope for one first trial (but may
be I missed the use-cases for it)
Although I agree that cell-level is far less common, here are a
couple of use cases I've hit recently:
1) the array represents time series in columns. The original data
comes in a different frequency for each column, with some data
missing. When you align to a common frequency and interpolate missing
values, I needed a factor array of the same dimension as the data
array identifying whether each observation corresponded to the actual
original series, or had been interpolated, and whether interpolation
was due to missing data or to frequency alignment. Of course, I
needed the factor array to be subsetted together with the array.
In that respect, and as you outline it, this is then like
"stacking"/"putting side-by-side" arrays of identical dimensions. Your
time serie data is in one array, the origin of the observation in an
other...
I would see that as a separate data structure (that could implement the
metadata interface we are discussing).
2) the array is a table representing data to be formatted by a
reporting system (Sweave, R2HTML, etc), similar to the 'xtable'
class. So I needed to associate formatting information to each
individual "cell" (font, color, borders...), as well to each
dimension and to the whole table.
Anyway, it's far easier to add "cell-level" metadata on top of the
other features with a new class: for `[` subscripting just call
NextMethod() and then apply the same indexes to the object storing
the cell-level metadata. But I still think it's useful to work out
data object's metadata at all possible levels with a unified
interface.
I understand the use cases, but I can't stop stop thinking that this
should be separated from the dimension-associated metadata.
In the examples above, the data structures are two-dimensional and
therefore dimension-associated metadata will be for "rows" and for
"columns"; all the cells in a table/array as a sequence are not mapped
to any *dimension*.
About the subscripting `[` methods, I don't see the need to modify
`[<-` for arrays, as out-of-bound indexes generate errors with arrays
(unlike vectors or data frames), so `[<-` would only replace data and
leave metadata untouched. Am I missing something?
That's what I am thinking.
I bundle "[" with "[<-" to specify that the way indexing is done would
remain the same (for a second I considered that someone though of
somehow indexing on the names of the dimensions, or on the metadata).
may be a function called "dimmeta()" (for consistency with
"dimnames()") ?
I'm using 'dimdata' in my current prototype, and Henrik suggested
'dimattr', but I really like your proposal more.
the colour of the bikeshed
Wrappers to the two first elements of 'dimmeta' for 2-dim arrays
could be added in the same vein as 'rownames' and 'colnames':
'rowmeta' and 'colmeta'.
Yes. That the spirit.
The signature could be dimmeta(x, i), with x the object,
For consistency with 'dimnames', the 'i' argument could be dropped
and use dimmeta(x)[[i]] instead...
I thought about that, but also thought that it could have implications
on the actual storage of those metadata. In the case the metadata are
stored in a list, that interface enforces the building of a list.
(I said to ignore implementation for now, but paradoxically this made me
consider possible implementations).
Let's ignore that and go for consistency first (there will always be
time to come back on that and make backward compatible changes, such as
dimmeta(x, i=NULL) # return the list if i is NULL ).
Other standard generics to be affected would be:
* rbind & cbind for 2-dim arrays/matrices: they should combine the
metadata, and for dimension-sensitive metadata can be modelled upon
what is done with dimnames: use rowmeta (colmeta) of the first object
with them in cbind (rbind), and combine colmeta (rowmeta) of all
objects with them, filling with NAs/NULLs/.. for non
metadata-sensitive objects being combined. An issue of coercing
dimmeta of different classes may arise.
May be good to be trigger-happy for a first pass ( stop("mismatching
meta data - sorry") )... and mix-and-match use cases might be fewer.
* `dim<-`, but this may raise the same problem of coercing dimmeta of
different classes.
Disabling "dim<-" is, I think, choosing sanity for now.
...and I agree with the rest of your comments.
Same for me (about your comments).
This thread seems to be leading to something great.
L.
Best,
Enrique
-----Original Message----- From: Laurent Gautier
[mailto:lgaut...@gmail.com] Sent: jueves, 09 de julio de 2009 14:15
Cc: Heinz Tuechler; Bengoechea Bartolomé Enrique (SIES 73); Tony
Plate; Henrik Bengtsson; r-devel@r-project.org Subject: Re: [Rd]
Suggestion: Dimension-sensitive attributes
Starting by working on an interface for such object(s) is probably
the first step toward a unified solution, and this before about if
and how R attributes are used.
It would also help to ensure a smooth transition from the existing
classes implementing a similar solution (first the interface is added
to those classes, then after a grace period the classes are
eventually refactored).
Dimension-level is what seems to the be most needed... but I am not
convinced of the practicality of the object-level, and cell-level
scheme s proposed:
- Object-level, if not linked to any dimension-attribute is such
saying that one want to attach anything to any object. That's what
attr() is already doing.
- Cell-level, is may be out-of-scope for one first trial (but may be
I missed the use-cases for it)
If starting with behaviour, it seems to boil to having "["/"[<-" and
"dimmeta()"/"dimmeta<-()", :
- extract "[" / replace "[<-" :
* keeps working the way it already does
* extracts a subset of the object as well as a subset of the
dimension-associated metadata.
* departing too much from the way "[" is working and add
behind-the-curtain name matching will only compromise the chances of
adoption.
* forget about the bit about which metadata is kept and which one
isn't when using "[". Make a function "unmeta()" (similar behavior to
"unname()") to drop them all, or work it out with something like
dimmeta(x, 1) <- NULL # drop the metadata associated with dimension
1
- access the dimension-associated metadata:
* may be a function called "dimmeta()" (for consistency with
"dimnames()") ? The signature could be dimmeta(x, i), with x the
object, and i the dimension requested. A replace function
"dimmeta<-"(x, i, value) would be provided.
In the abstract the "names" associated with a given dimension is just
one of possible metadata, but I'd keep away from meddling with it
for a start.
It would seem natural that metadata associated with one dimension:
would a table-like object (data.frame seems natural in R, and
unfortunately there is no data.frame-like structure in R).
L.
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel