On Sep 18, 2013, at 7:46 PM, lochana menikarachchi wrote: > I need to quickly load 5-10 molecules to a jTable from a large SD file(say 1 > million structures). ... MarvinView can load 5-10 structures from extremely > large files in few seconds. I wonder how marvin does this?? Any suggestions > to replicate this functionality with CDK??
I was curious on what the absolute fastest indexing time could be. If you can assume that the string "$$$$" is only found at the end of the record, and not in oddities like: > <price> $$$$ or the extreme edge-case "S SKP" section, then you might be able to get indexing time of about 3 seconds. % ls -lh chembl_14.sdf -rw-r--r-- 1 dalke admin 2.6G Sep 18 22:47 chembl_14.sdf % time fgrep -c '$$$$' chembl_14.sdf 1212539 2.629u 0.397s 0:03.05 98.6% 0+0k 8+1io 0pf+0w This depends much on your system: before I restarted my computer this took around 25 seconds because I had very little free memory left. This was also from the second time I ran the test, so the disk cache was hot. However, that grep is a bad solution for a general-purpose SD record tokenizer because valid SD records will break that simple scanner. A few months ago I tried writing a correct one in C. The best I could do takes 14 seconds for this case. I don't see how MarvinView can be much faster than this. Since you say "a few seconds", I wonder if it has an indexing thread in the background, which scans the rest of the file while you are looking at the first 10 or so records. I do think that a fast low-level indexer, which reports the record id and start/end byte positions but does not perceive chemistry, is a very useful tool to have. It sounds like CDK doesn't have such a thing, and the other toolkits I know of (Open Babel, RDKit, and OEChem) don't have one either. On Sep 18, 2013, at 9:57 PM, John May wrote: > In terms of reading sections of a file - if it's uncompressed it would be > nice to have a utility to do something with memory mapping > (http://javarevisited.blogspot.co.uk/2012/01/memorymapped-file-and-io-in-java.html). Egon reported times of: real 9m58.781s user 9m14.000s sys 0m8.528s This doesn't suggest that there's an I/O bottleneck that could be improved by memory mapping. > For the faster basic reader I've been hacking on/off at a reimplementation > for the last year or so. ... There may be a faster implementation in future > versions but as Joos says this requires some significant effort. I've had the same experience. Cheers, Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk _______________________________________________ Cdk-user mailing list Cdk-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/cdk-user