Re: [datatable-help] number of rows selected in .SD subset

Ben Tupper Thu, 22 Jan 2015 11:36:12 -0800

Hi Arun,

The vignette is very helpful; the ?data.table help page is so rich and dense 
that I end up wandering quite a bit.  The vignette does a nice job laying it 
out logically.  I'm sure it has been a huge effort.


> DT[, .SD[.N], by=month] # ~ since .N contains the number of observations in 
> this group


Doh! Now I see it is stated clearly right under my nose: ".N is a special 
in-built variable that holds the number of observations in the current group."  
I'm not sure why I thought .N was for the original data.table instead of the 
grouping.

I have switched my script to use the above and it is lightning fast now.  I'm 
going to start wearing a seatbelt and helmet...

Thank you!
Ben

P.S.  How did you get download.file to read an https URL for the example?  
Here's what I get...
flights <- 
fread("https://raw.githubusercontent.com/wiki/arunsrinivasan/flights/NYCflights14/flights14.csv";)
Error in download.file(input, tt, mode = "wb") : unsupported URL scheme
So, I downloaded it manually using a browser and used my local copy instead.




On Jan 22, 2015, at 2:11 PM, Arunkumar Srinivasan <[email protected]> wrote:

> Ben,
> 
> Great to hear that you're going thro' the vignette..
> 
> To get the last row, you can similarly do:
> 
> DT[, tail(.SD, 1L), by=month] # ~ as you say
> DT[, .SD[.N], by=month] # ~ since .N contains the number of observations in 
> this group
> DT[, .SD[(.N-1L):.N], by=month] # ~ last two rows per group
> 
> However, `.SD[...]` per group is slightly slower (especially on many groups) 
> as it has to go through `[.data.table` (which is a S3 generic, and takes time 
> for dispatching the right method.. which can get noticeable on large groups), 
> and not all cases are optimised. 
> 
> You can also use `.I` (which is deliberately not mentioned in the vignette to 
> keep things smooth and straightforward). Using it you could do:
> 
> idx = DT[, .I[1L], by=month][, V1]
> DT[idx]
> 
> `.I` contains the row number in `x` (it doesn't reset per group..). So we can 
> get the row indices for each group for the first element, and then simply 
> subset. We hope to improve this subset in the future (to take care of this 
> optimisation internally).
> 
> Similarly:
> 
> idx = DT[, .I[.N], by=month][, V1]
> DT[idx]
> 
> will get the last element for each group.
> 
> Otherwise, how do you find the vignette so far?
> 
> HTH,
> Arun
> 
> 
> 
> On Thu, Jan 22, 2015 at 6:11 PM, Ben Tupper <[email protected]> wrote:
> Hello,
> 
> I have been learning to use data.table and studying the vignette located 
> here...
> 
> https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro-vignette.html
> 
> Section 2f. shows how to subset a data.table to select an arbitrary number of 
> rows in each .SD.  That's really handy.
> 
> 2. Aggregations
>   f. Subset .SD for each group:    ans <- flights[, head(.SD, 2), by=month]
> 
> In a similar way, I can get the last row of the .SD using either tail, nrow 
> or dim (I don't think it matters much, but dim seems to be a faster*).
> 
>   ans <- flights[,.SD[dim(.SD)[1]], by=month]
> 
> I got to wondering if the number of rows in .SD might be exposed in each 
> grouping iteration.  Is there an equivalent to .N for the subset data.table, 
> .SD?  Something like .SDN or the like?
> 
> Thanks for data.table!
> 
> Ben
> 
> * After reading this discussion 
> http://r.789695.n4.nabble.com/What-is-the-fastest-way-to-determine-that-data-table-is-empty-td4638348.html#a4638451
>  I tried out a couple of methods for getting the last element of a grouping 
> using nrow(), tail() and dim().
> 
> # using tail
> > microbenchmark( last1 <- flights[, tail(.SD, 1), by=month] )
> Unit: milliseconds
>                                          expr      min       lq     mean   
> median       uq      max neval
>  last1 <- flights[, tail(.SD, 1), by = month] 16.65898 16.89704 18.26415 
> 17.37007 19.20147 40.12966   100
> 
> # using dim
> >   microbenchmark( last2 <- flights[,.SD[dim(.SD)[1]], by=month] )
> Unit: milliseconds
>                                              expr      min       lq     mean  
>  median       uq      max neval
>  last2 <- flights[, .SD[dim(.SD)[1]], by = month] 15.51243 15.87788 17.40978 
> 16.19426 17.83308 59.22429   100
> 
> # using nrow
> >   microbenchmark( last3 <- flights[,.SD[nrow(.SD)], by=month] )
> Unit: milliseconds
>                                            expr      min       lq     mean   
> median       uq      max neval
>  last3 <- flights[, .SD[nrow(.SD)], by = month] 15.63919 15.92073 17.28836 
> 16.52588 18.33867 24.92624   100
> 
> >   identical(last1, last2)
> [1] TRUE
> >   identical(last1, last3)
> [1] TRUE
> 
> Ben Tupper
> Bigelow Laboratory for Ocean Sciences
> 60 Bigelow Drive, P.O. Box 380
> East Boothbay, Maine 04544
> http://www.bigelow.org
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> datatable-help mailing list
> [email protected]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 

Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] number of rows selected in .SD subset

Reply via email to