Re: [Cdk-user] Reading large SD Files

Andrew Dalke Wed, 18 Sep 2013 17:08:32 -0700

On Sep 18, 2013, at 7:46 PM, lochana menikarachchi wrote:
> I need to quickly load 5-10 molecules to a jTable from a large SD file(say 1 
> million structures). ... MarvinView can load 5-10 structures from extremely 
> large files in few seconds. I wonder how marvin does this?? Any suggestions 
> to replicate this functionality with CDK??

I was curious on what the absolute fastest indexing time could be. If you can 
assume that the string "$$$$" is only found at the end of the record, and not 
in oddities like:

> <price>
$$$$

or the extreme edge-case "S  SKP" section, then you might be able to get 
indexing time of about 3 seconds.

% ls -lh chembl_14.sdf
-rw-r--r--  1 dalke  admin   2.6G Sep 18 22:47 chembl_14.sdf
% time fgrep -c '$$$$' chembl_14.sdf
1212539
2.629u 0.397s 0:03.05 98.6%     0+0k 8+1io 0pf+0w

This depends much on your system: before I restarted my computer this took 
around 25 seconds because I had very little free memory left. This was also 
from the second time I ran the test, so the disk cache was hot.

However, that grep is a bad solution for a general-purpose SD record tokenizer 
because valid SD records will break that simple scanner.

A few months ago I tried writing a correct one in C. The best I could do takes 
14 seconds for this case.

I don't see how MarvinView can be much faster than this. Since you say "a few 
seconds", I wonder if it has an indexing thread in the background, which scans 
the rest of the file while you are looking at the first 10 or so records.

I do think that a fast low-level indexer, which reports the record id and 
start/end byte positions but does not perceive chemistry, is a very useful tool 
to have. It sounds like CDK doesn't have such a thing, and the other toolkits I 
know of (Open Babel, RDKit, and OEChem) don't have one either.

On Sep 18, 2013, at 9:57 PM, John May wrote:
> In terms of reading sections of a file - if it's uncompressed it would be 
> nice to have a utility to do something with memory mapping 
> (http://javarevisited.blogspot.co.uk/2012/01/memorymapped-file-and-io-in-java.html).

Egon reported times of:

real    9m58.781s
user    9m14.000s
sys     0m8.528s

This doesn't suggest that there's an I/O bottleneck that could be improved by 
memory mapping.

> For the faster basic reader I've been hacking on/off at a reimplementation 
> for the last year or so. ... There may be a faster implementation in future 
> versions but as Joos says this requires some significant effort.

I've had the same experience.

Cheers,

                                Andrew
                                da...@dalkescientific.com

------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Reading large SD Files

Reply via email to