Hi Jon Olav, A Tuesday 07 July 2009 13:44:30 Jon Olav Vik escrigué: > The problem in brief: Why does it take 20-40 seconds to extract a table > column of 200000 integers? The code snippet in question is: > > with pt.openFile(filename) as f: > vlarrayrow = f.root.gp.cols.vlarrayrow[:]
Quick answer: when your dataset fits in OS filesystem memory cache, the retrieval is very fast. If not, you can go at you disk speed as maximum. For example, in my 8 GB machine, the results of your benchmark are: INFO:root:0.270664930344 seconds, (nrow,othersize=50000,2000) 0.270664930344 INFO:root:0.232949972153 seconds, (nrow,othersize=50000,2000) INFO:root:0.235331773758 seconds, (nrow,othersize=52000,2000) INFO:root:0.244709968567 seconds, (nrow,othersize=54000,2000) INFO:root:0.245009899139 seconds, (nrow,othersize=56000,2000) INFO:root:0.276184082031 seconds, (nrow,othersize=58000,2000) INFO:root:0.262171030045 seconds, (nrow,othersize=50000,2200) INFO:root:0.339207172394 seconds, (nrow,othersize=52000,2200) INFO:root:0.288460016251 seconds, (nrow,othersize=54000,2200) INFO:root:0.294430017471 seconds, (nrow,othersize=56000,2200) INFO:root:0.302571773529 seconds, (nrow,othersize=58000,2200) INFO:root:0.279940843582 seconds, (nrow,othersize=50000,2400) INFO:root:0.290556192398 seconds, (nrow,othersize=52000,2400) INFO:root:0.309056997299 seconds, (nrow,othersize=54000,2400) INFO:root:0.317222118378 seconds, (nrow,othersize=56000,2400) INFO:root:0.327784061432 seconds, (nrow,othersize=58000,2400) Here, the speed to retrieve the first table (381 MB) is around .23 s, which makes for a speed of 1.6 GB/s aprox, while for the last table (531 MB) is .33 s, which makes for the same 1.6 GB/s speed. So, I can't see the performance gap at all. The reason is that both sizes fits well in OS filesystem cache and they can be transferred at RAM speeds. However, if I try with a much larger table (19 GB), I get: INFO:root:110.066845894 seconds, (nrow,othersize=1000000,5000) 110.066845894 which makes for 173 MB/s, which is the speed of my disk and 10x less than my memory subsystem. For your case, my guess is that your table sizes are in the limits of your available RAM for OS caching, and hence the apparently erratic read speeds. As you may have noticed, I've counted the *total* size of the table as being read instead of the size of only one single column. This is because the table is organized row-wise on-disk, and you need to read *all* the columns for accessing just one. The only solution to avoid this is to implement a column- wise table, which I'd like to implement in a next future (but not there yet). At any rate, that the speed of table access is so high when the table fits in OS cache is a good indication that PyTables is behaving well and getting the most out of the underlying hardware :) Cheers, -- Francesc Alted ------------------------------------------------------------------------------ Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users