So the IteratingSDFReader isn't as slow as other parts of the CDK but it could 
certainly be improved. 

Here's a couple of general thoughts
The IteratingSDFReader is flexible and will read the record once, then again to 
determine if it's MDL, MDLV2000 or MDLV3000, then again to actually parse. Now 
these aren't disk reads but it's a fair bit of redundant String traversal and 
buffering. Maybe someone will give it a file with a mixture of V2000 and V3000 
but I think that could probably be removed so that you say via an option "I 
only have V2000 - please don't check this and read it straight away".
In java 1.6 the MDLV2000Reader has some slow points - substrings are cheap but 
take for example this line 
(https://github.com/johnmay/cdk/blob/master/src/main/org/openscience/cdk/io/MDLV2000Reader.java#L472)
 - the substring is just a reference to the 'line' in a file (even after the 
trim) this then keeps the entire line around (not good).
In java 1.7 substring now will copy the string data - however MDLV2000Reader 
uses tonnes of substring operations can this could cause a slowdown.

In terms of reading sections of a file - if it's uncompressed it would be nice 
to have a utility to do something with memory mapping 
(http://javarevisited.blogspot.co.uk/2012/01/memorymapped-file-and-io-in-java.html).

For the faster basic reader I've been hacking on/off at a reimplementation for 
the last year or so. I don't have time at the moment though and I really want 
to get other stuff finished so we can release 1.6. There may be a faster 
implementation in future versions but as Joos says this requires some 
significant effort.

If you can guarantee the format is correct (i.e. digits padding correctly) you 
can write a very fast parser of the atom block line.

https://github.com/johnmay/cdk/blob/82507e981b8acb5ac3e7b94b829eb7f242b38d48/src/main/org/openscience/cdk/io/MDLV2000AtomBlock.java

I think as this stands will read most of what the current reader does. Other 
parts of the parser need adapting but speeding up the atom block parsing is 
certainly a start.

J

On 18 Sep 2013, at 19:34, Joos Kiener <j...@sunrise.ch> wrote:

> Hi Lochana,
> 
> I think you sure will need Multi-threading to keep the UI responsive and also 
> some form of cache, meaning if you display 10 compounds in the "viewport" of 
> the table, keep a lot more in memory, maybe 100 both up or down and when the 
> user scrolls, adjust the cache accordingly but the scrolling will be fluid 
> and fast.
> 
> Use this with the RandomAccessSDFReader. I don't know it's performance and I 
> guess the indexing phase can take pretty long. Hence that should be run in 
> the background. Maybe you can extend it to use above mentioned cache because 
> it sill uses file access and that is always very, very slow compared to 
> memory.
> 
> Or create your own reader that can quickly jump to the desired record but 
> does not need any indexing. eg. if you display record 100 to 110, and have as 
> example records 50 to 150 in cache and user scrolls up 1 page read records 
> 40-49 from the file into the cache. Hint: record 40 (assuming index 0-based) 
> starts at the 40th occurrence  of $$$$. So maybe just indexing all $$$$ 
> positions could suffice (no idea how fast that is in a large file). With this 
> I would probably cache more like in the 1000 record range and not adjust 
> cache for every single page change.
> 
> Anyway I don't think this can be made fast without some significant effort. 
> 
> Best Regards,
> 
> Joos
> 
> Am 18.09.2013 19:46, schrieb lochana menikarachchi:
>> Hi Everyone,
>> 
>> I need to quickly load 5-10 molecules to a jTable from a large SD file(say 1 
>> million structures). The table needs to be updated by loading only 5-10 
>> structures as the user scrolls down the table. I tried various SDF readers 
>> in CDK. Iterating reader, RandomAccessSDFReader but, they are extremely slow 
>> compared to what MarvinView (written in java) has. MarvinView can load 5-10 
>> structures from extremely large files in few seconds. I wonder how marvin 
>> does this?? Any suggestions to replicate this functionality with CDK??
>> 
>> Thanks.
>> 
>> Lochana
>> 
>> 
>> ------------------------------------------------------------------------------
>> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
>> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
>> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack 
>> includes
>> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
>> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
>> 
>> 
>> _______________________________________________
>> Cdk-user mailing list
>> Cdk-user@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/cdk-user
> 
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk_______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/cdk-user

------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to