Hi all, the issue with RandomAccessSDFReader, more specifically the underlying RandomAccessReader is that it uses RandomAccessFiles readLine() method. That method is very, very bad in terms of performance (because it's badly written). Hence the indexing takes very long. A solution would be to rewrite the index method without using the read line method.
I have not looked at the indexing method and what exactly it does but here is a way to index the start (as a byte offset) of every record in an sd-file into a Map<Integer,Long>. see below. Indexing this takes 3-4 seconds (on an SSD...) with a 540 mb sd-file (from ZINC) containing aprox. 131'000 records. hence that would be about 30 sec. for 1 mio compounds. private void index() throws IOException { sdfIndex.put(0, 0L); // first record byte[] buffer = new byte[8192]; int recordIndex = 0; int newLineOffset = 1; // 1 or 2 depending if it is \n or \r\n int bytesRead; while ((bytesRead = raf.read(buffer)) != -1) { String data = new String(buffer, "US-ASCII"); // determine new line delimiter once // can be \n for sd-files also on Windows if // the were generated on linux or by certain toolkits if (recordIndex == 0) { if (getLineSeparator(data).equals("\r\n")) { newLineOffset = 2; } } ArrayList<Integer> recordEnds = new ArrayList<>(); int index = data.indexOf(DELIMITER); while (index >= 0) { recordEnds.add(index); index = data.indexOf(DELIMITER, index + 1); } long offsetBeforeRead = raf.getFilePointer() - bytesRead; for (int position : recordEnds) { // we want to start reading after the delimiter // on the next new line // aaaaaa // $$$$ // bbbbbb <- get offset were this line starts long recordOffset = offsetBeforeRead + position + DELIMITER.length() + newLineOffset; sdfIndex.put(recordIndex, recordOffset); recordIndex++; } } // sd-files terminate with DELIMITER // hence the last entry in index must be removed as no // record will be there. sdfIndex.remove(recordIndex - 1); } See appended a full implementation of above idea. Note that it returns text data only, no chemistry. (works, but not really tested, use at own risk!!!). 2013/9/19 lochana menikarachchi <locha...@yahoo.com> > > Joos, > > You are right. I should use a local database instead of SD files... > > Lochana > ------------------------------ > *From:* Joos Kiener <j...@sunrise.ch> > *To:* lochana menikarachchi <locha...@yahoo.com> > *Cc:* "cdk-user@lists.sourceforge.net" <cdk-user@lists.sourceforge.net> > *Sent:* Thursday, September 19, 2013 9:37 AM > *Subject:* Re: [Cdk-user] Reading large SD Files > > I played a round a bit and came up with a crude solution as I mentioned in > my initial response. > > index all occurrences of "$$$$" -> takes 3-4 seconds for a file with > 131'000 records > > use separate thread to index to increase performance but current > implementation requires that index is fully built. This is an issue as you > need to have 2 access mechanisms, index based and not-index based. > > > Use BufferedReader to go to the indexed line, eg > > for (int i = 0; i < linesToRead; i++) { > bufferedReader.readLine(); > } > > yeah, not ideal but it actually is faster than I expected. > > add caching to it. > > > But a question remains: > > What is your actual goal? Why can't you use Marvin, for commercial use? > 1 million is a lot. Using a real database comes to mind. > > > > 2013/9/19 lochana menikarachchi <locha...@yahoo.com> > > Hi Nina, > > I did try the RandomAccessSDFReader. It took few minutes to build the > index for an SD file with 50,000 structures. What I am saying is what ever > MarvinView does to build index (if it is using an index) is much faster. I > am wondering how it does that. > > Lochana > > > ------------------------------------------------------------------------------ > LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! > 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, > SharePoint > 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack > includes > Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. > http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk > _______________________________________________ > Cdk-user mailing list > Cdk-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/cdk-user > > > > > > > ------------------------------------------------------------------------------ > LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! > 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, > SharePoint > 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack > includes > Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. > http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk > _______________________________________________ > Cdk-user mailing list > Cdk-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/cdk-user > >
ScrollingSdfReader.7z
Description: Binary data
------------------------------------------------------------------------------ LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
_______________________________________________ Cdk-user mailing list Cdk-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/cdk-user