Thank you David - it took me awhile to get back to this and dig into
it. It's clever to imitate gtools::mixedorder() as far as possible.
A few comments:
1. It took me a while to understand why you picked 3899 in your
Roman-to-integer table; it's because roman(x) is NA for x > 3899.
(BTW, in 'utils', there's utils:::.roman2numeric() which could be
utilized, but it's currently internal.)
2. I think you forgot D=500 and M=1000.
3. There was a typo in your code; I think you meant rank.roman instead
of rank.numeric in one place.
4. The idea behind nonnumeric() is to identify non-numeric substrings
by is.na(as.numeric()). Unfortunately, for romans that does not work.
Instead, we need to use is.na(numeric(x)) here, i.e.
nonnumeric <- function(x) {
suppressWarnings(ifelse(is.na(numeric(x)), toupper(x), NA))
}
Actually, gtools::mixedorder() could use the same.
5. I undid your ".numeric" to ".roman" to minimize any differences to
gtools::mixedorder().
With the above fixes, we now have:
mixedorderRoman <- function (x)
{
if (length(x) < 1)
return(NULL)
else if (length(x) == 1)
return(1)
if (is.numeric(x))
return(order(x))
delim = "\\$\\@\\$"
# NOTE: Note that as.roman(x) is NA for x > 3899
romanC <- as.character( as.roman(1:3899) )
numeric <- function(x) {
suppressWarnings(match(x, romanC))
}
nonnumeric <- function(x) {
suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
NA))
}
x <- as.character(x)
which.nas <- which(is.na(x))
which.blanks <- which(x == "")
if (length(which.blanks) > 0)
x[which.blanks] <- -Inf
if (length(which.nas) > 0)
x[which.nas] <- Inf
delimited <- gsub("([IVXCLM]+)",
paste(delim, "\\1", delim, sep = ""), x)
step1 <- strsplit(delimited, delim)
step1 <- lapply(step1, function(x) x[x > ""])
step1.numeric <- lapply(step1, numeric)
step1.character <- lapply(step1, nonnumeric)
maxelem <- max(sapply(step1, length))
step1.numeric.t <- lapply(1:maxelem, function(i) sapply(step1.numeric,
function(x) x[i]))
step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
function(x) x[i]))
rank.numeric <- sapply(step1.numeric.t, rank)
rank.character <- sapply(step1.character.t, function(x)
as.numeric(factor(x)))
rank.numeric[!is.na(rank.character)] <- 0
rank.character <- t(t(rank.character) + apply(matrix(rank.numeric),
2, max, na.rm = TRUE))
rank.overall <- ifelse(is.na(rank.character), rank.numeric,
rank.character)
order.frame <- as.data.frame(rank.overall)
if (length(which.nas) > 0)
order.frame[which.nas, ] <- Inf
retval <- do.call("order", order.frame)
return(retval)
}
The difference to gtools::mixedorder() is minimal:
< romanC <- as.character( as.roman(1:3899) )
21c11
< suppressWarnings(match(x, romanC))
---
> suppressWarnings(as.numeric(x))
24c14
< suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
---
> suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
34c24
< delimited <- gsub("([IVXCLDM]+)",
---
> delimited <-
> gsub("([+-]{0,1}[0-9]+\\.{0,1}[0-9]*([eE][\\+\\-]{0,1}[0-9]+\\.{0,1}[0-9]*){0,1})",
59,62d48
This difference is so small that the above could now be an option to
mixedorder() with minimal overhead added, e.g. mixedorder(y,
type=c("decimal", "roman")). One could even imagine adding support
for "binary", "octal" and "hexadecimal" (not done).
Greg (maintainer of gtools; cc:ed), is this something you would
consider adding to gtools? I've modified the gtools source code
available on CRAN (that's the only source I found), added package
tests, updated the Rd and verified it passes R CMD check. If
interested, please find the updates at:
https://github.com/HenrikBengtsson/gtools/compare/cran:master...master
Thanks
Henrik
On Tue, Aug 26, 2014 at 6:46 PM, David Winsemius <[email protected]> wrote:
>
> On Aug 26, 2014, at 5:24 PM, Henrik Bengtsson wrote:
>
>> Hi,
>>
>> does anyone know of an implementation/function that sorts strings that
>> *contain* roman numerals (I, II, III, IV, V, ...) which are treated as
>> numbers. In 'gtools' there is mixedsort() which does this for strings
>> that contains (decimal) numbers. I'm looking for a "mixedsortroman()"
>> function that does the same but with roman numbers, e.g.
>
> It's pretty easy to sort something you know to be congruent with the existing
> roman class:
>
> romanC <- as.character( as.roman(1:3899) )
> match(c("I", "II", "III","X","V"), romanC)
> #[1] 1 2 3 10 5
>
> But I guess you already know that, so you want a regex approach to parsing.
> Looking at the path taken by Warnes, it would involve doing something like
> his regex based insertion of a delimiter for "Roman numeral" but simpler
> because he needed to deal with decimal points and signs and exponent
> notation, none of which you appear to need. If you only need to consider
> character and Roman, then this hack of Warnes tools succeeds:
>
> mixedorderRoman <- function (x)
> {
> if (length(x) < 1)
> return(NULL)
> else if (length(x) == 1)
> return(1)
> if (is.numeric(x))
> return(order(x))
> delim = "\\$\\@\\$"
> roman <- function(x) {
> suppressWarnings(match(x, romanC))
> }
> nonnumeric <- function(x) {
> suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
> NA))
> }
> x <- as.character(x)
> which.nas <- which(is.na(x))
> which.blanks <- which(x == "")
> if (length(which.blanks) > 0)
> x[which.blanks] <- -Inf
> if (length(which.nas) > 0)
> x[which.nas] <- Inf
> delimited <- gsub("([IVXCL]+)",
> paste(delim, "\\1", delim, sep = ""), x)
> step1 <- strsplit(delimited, delim)
> step1 <- lapply(step1, function(x) x[x > ""])
> step1.roman <- lapply(step1, roman)
> step1.character <- lapply(step1, nonnumeric)
> maxelem <- max(sapply(step1, length))
> step1.roman.t <- lapply(1:maxelem, function(i) sapply(step1.roman,
> function(x) x[i]))
> step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
> function(x) x[i]))
> rank.roman <- sapply(step1.roman.t, rank)
> rank.character <- sapply(step1.character.t, function(x)
> as.numeric(factor(x)))
> rank.roman[!is.na(rank.character)] <- 0
> rank.character <- t(t(rank.character) + apply(matrix(rank.roman),
> 2, max, na.rm = TRUE))
> rank.overall <- ifelse(is.na(rank.character), rank.numeric,
> rank.character)
> order.frame <- as.data.frame(rank.overall)
> if (length(which.nas) > 0)
> order.frame[which.nas, ] <- Inf
> retval <- do.call("order", order.frame)
> return(retval)
> }
>
> y[mixedorderRoman(y)]
> [1] "chr I" "chr II" "chr III" "chr IV" "chr IX"
> [6] "chr V" "chr VI" "chr VII" "chr VIII" "chr X"
> [11] "chr XI" "chr XII"
>
>
> --
> David.
>>
>> ## DECIMAL NUMBERS
>>> x <- sprintf("chr %d", 12:1)
>>> x
>> [1] "chr 12" "chr 11" "chr 10" "chr 9" "chr 8"
>> [6] "chr 7" "chr 6" "chr 5" "chr 4" "chr 3"
>> [11] "chr 2" "chr 1"
>>
>>> sort(x)
>> [1] "chr 1" "chr 10" "chr 11" "chr 12" "chr 2"
>> [6] "chr 3" "chr 4" "chr 5" "chr 6" "chr 7"
>> [11] "chr 8" "chr 9"
>>
>>> gtools::mixedsort(x)
>> [1] "chr 1" "chr 2" "chr 3" "chr 4" "chr 5"
>> [6] "chr 6" "chr 7" "chr 8" "chr 9" "chr 10"
>> [11] "chr 11" "chr 12"
>>
>>
>> ## ROMAN NUMBERS
>>> y <- sprintf("chr %s", as.roman(12:1))
>>> y
>> [1] "chr XII" "chr XI" "chr X" "chr IX"
>> [5] "chr VIII" "chr VII" "chr VI" "chr V"
>> [9] "chr IV" "chr III" "chr II" "chr I"
>>
>>> sort(y)
>> [1] "chr I" "chr II" "chr III" "chr IV"
>> [5] "chr IX" "chr V" "chr VI" "chr VII"
>> [9] "chr VIII" "chr X" "chr XI" "chr XII"
>>
>>> mixedsortroman(y)
>> [1] "chr I" "chr II" "chr III" "chr IV"
>> [5] "chr V" "chr VI" "chr VII" "chr VIII"
>> [9] "chr IX" "chr X" "chr XI" "chr XII"
>>
>> The latter is what I'm looking for.
>>
>> Before hacking together something myself (e.g. identify roman numerals
>> substrings, translate them to decimal numbers, use gtools::mixedsort()
>> to sort them and then translate them back to roman numbers), I'd like
>> to hear if someone already has this implemented/know of a package that
>> does this.
>>
>> Thanks,
>>
>> Henrik
>>
>> ______________________________________________
>> [email protected] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.