Also, it makes a difference if the groups happen to be contiguous in the table, or not.
Try creating a large table with large sized groups, where each group is scattered throughout the table non-contiguously. Time an ad hoc by. Then set a key, remove the key, and time the ad hoc by again. The 2nd ad hoc by should be much faster. Then set the key again, and time a keyed by, it should be faster still. Does that illustrate what's going on? On Thu, 2011-08-25 at 19:18 +0100, Matthew Dowle wrote: > JJ, > Yes, Chris is spot on. > keyed by should be faster when the size of each group is large; e.g., a > 1 billion row data.table of 1,000 groups. See FAQ 3.3 for why. > However in your example, ad hoc by does seem more appropriate. > Matthew > > On Thu, 2011-08-25 at 11:17 -0400, Chris Neff wrote: > > You don't necessarily have to use keys at all. When you aggregate and > > give the by columns, they don't necessarily have to be keys of the > > data table. This is called an "ad-hoc by". It is slightly slower, but > > my intuition says that it isn't really any slower than setting the > > key. > > > > When you add a key you sort by those fields. You incur a time cost > > for that. If you are consistently doing things with those keys then > > you may make up for that time cost further on. But for multiple > > different groupings the ad-hoc by is probably faster. Do some timings > > to see. Some simple ones I did show that the act of sorting is slower > > than ad-hoc by. > > > > On 25 August 2011 11:05, Jean Jacques Dureau <[email protected]> wrote: > > > Hi, > > > i have a data.table (10,000k of rows) with 20 (factor) fields and i > > > need to filter data according some of them. > > > I use this data.table inside a function and i don't know "in advance" > > > wich fileds i'll use to filter data and to sum. > > > > > > So, for example, consider a data.table (named dt_data) with 20 fileds, > > > named f1, f2, ... ,f20. > > > > > > I use this approach: i set the key on the field i have to use, for > > > example f2. Then i "filter" the data and i use them to do some > > > computations. > > > > > > Subsequently, with these computations, i discover wich fileds i have > > > to filter, for example f4 and f5. Now, i set the key on dt_data on > > > (f4,f5), and so on ... > > > > > > I use this approach because i don't know if it's possible to set the > > > key on all fields f1, f2, .., f20 in advance and then use only some of > > > them! > > > > > > Is there a better way to use data.table? > > > > > > thanks > > > > > > jj > > > _______________________________________________ > > > datatable-help mailing list > > > [email protected] > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > _______________________________________________ > > datatable-help mailing list > > [email protected] > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
