Hi Arun,
The vignette is very helpful; the ?data.table help page is so rich and dense
that I end up wandering quite a bit. The vignette does a nice job laying it
out logically. I'm sure it has been a huge effort.
> DT[, .SD[.N], by=month] # ~ since .N contains the number of observations in
> this group
Doh! Now I see it is stated clearly right under my nose: ".N is a special
in-built variable that holds the number of observations in the current group."
I'm not sure why I thought .N was for the original data.table instead of the
grouping.
I have switched my script to use the above and it is lightning fast now. I'm
going to start wearing a seatbelt and helmet...
Thank you!
Ben
P.S. How did you get download.file to read an https URL for the example?
Here's what I get...
flights <-
fread("https://raw.githubusercontent.com/wiki/arunsrinivasan/flights/NYCflights14/flights14.csv")
Error in download.file(input, tt, mode = "wb") : unsupported URL scheme
So, I downloaded it manually using a browser and used my local copy instead.
On Jan 22, 2015, at 2:11 PM, Arunkumar Srinivasan <[email protected]> wrote:
> Ben,
>
> Great to hear that you're going thro' the vignette..
>
> To get the last row, you can similarly do:
>
> DT[, tail(.SD, 1L), by=month] # ~ as you say
> DT[, .SD[.N], by=month] # ~ since .N contains the number of observations in
> this group
> DT[, .SD[(.N-1L):.N], by=month] # ~ last two rows per group
>
> However, `.SD[...]` per group is slightly slower (especially on many groups)
> as it has to go through `[.data.table` (which is a S3 generic, and takes time
> for dispatching the right method.. which can get noticeable on large groups),
> and not all cases are optimised.
>
> You can also use `.I` (which is deliberately not mentioned in the vignette to
> keep things smooth and straightforward). Using it you could do:
>
> idx = DT[, .I[1L], by=month][, V1]
> DT[idx]
>
> `.I` contains the row number in `x` (it doesn't reset per group..). So we can
> get the row indices for each group for the first element, and then simply
> subset. We hope to improve this subset in the future (to take care of this
> optimisation internally).
>
> Similarly:
>
> idx = DT[, .I[.N], by=month][, V1]
> DT[idx]
>
> will get the last element for each group.
>
> Otherwise, how do you find the vignette so far?
>
> HTH,
> Arun
>
>
>
> On Thu, Jan 22, 2015 at 6:11 PM, Ben Tupper <[email protected]> wrote:
> Hello,
>
> I have been learning to use data.table and studying the vignette located
> here...
>
> https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro-vignette.html
>
> Section 2f. shows how to subset a data.table to select an arbitrary number of
> rows in each .SD. That's really handy.
>
> 2. Aggregations
> f. Subset .SD for each group: ans <- flights[, head(.SD, 2), by=month]
>
> In a similar way, I can get the last row of the .SD using either tail, nrow
> or dim (I don't think it matters much, but dim seems to be a faster*).
>
> ans <- flights[,.SD[dim(.SD)[1]], by=month]
>
> I got to wondering if the number of rows in .SD might be exposed in each
> grouping iteration. Is there an equivalent to .N for the subset data.table,
> .SD? Something like .SDN or the like?
>
> Thanks for data.table!
>
> Ben
>
> * After reading this discussion
> http://r.789695.n4.nabble.com/What-is-the-fastest-way-to-determine-that-data-table-is-empty-td4638348.html#a4638451
> I tried out a couple of methods for getting the last element of a grouping
> using nrow(), tail() and dim().
>
> # using tail
> > microbenchmark( last1 <- flights[, tail(.SD, 1), by=month] )
> Unit: milliseconds
> expr min lq mean
> median uq max neval
> last1 <- flights[, tail(.SD, 1), by = month] 16.65898 16.89704 18.26415
> 17.37007 19.20147 40.12966 100
>
> # using dim
> > microbenchmark( last2 <- flights[,.SD[dim(.SD)[1]], by=month] )
> Unit: milliseconds
> expr min lq mean
> median uq max neval
> last2 <- flights[, .SD[dim(.SD)[1]], by = month] 15.51243 15.87788 17.40978
> 16.19426 17.83308 59.22429 100
>
> # using nrow
> > microbenchmark( last3 <- flights[,.SD[nrow(.SD)], by=month] )
> Unit: milliseconds
> expr min lq mean
> median uq max neval
> last3 <- flights[, .SD[nrow(.SD)], by = month] 15.63919 15.92073 17.28836
> 16.52588 18.33867 24.92624 100
>
> > identical(last1, last2)
> [1] TRUE
> > identical(last1, last3)
> [1] TRUE
>
> Ben Tupper
> Bigelow Laboratory for Ocean Sciences
> 60 Bigelow Drive, P.O. Box 380
> East Boothbay, Maine 04544
> http://www.bigelow.org
>
>
>
>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> [email protected]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help