Re: [Cdk-user] Reading large SD Files

Joos Kiener Fri, 20 Sep 2013 05:09:10 -0700

Hi all,

the issue with RandomAccessSDFReader, more specifically the underlying
RandomAccessReader is that it uses RandomAccessFiles readLine() method.
That method is very, very bad in terms of performance (because it's badly
written). Hence the indexing takes very long. A solution would be to
rewrite the index method without using the read line method.



I have not looked at the indexing method and what exactly it does but here
is a way to index the start (as a byte offset) of every record in an
sd-file into a Map<Integer,Long>. see below.
Indexing this takes 3-4 seconds (on an SSD...) with a 540 mb sd-file (from
ZINC) containing aprox. 131'000 records.
hence that would be about 30 sec. for 1 mio compounds.


private void index() throws IOException {

        sdfIndex.put(0, 0L); // first record
        byte[] buffer = new byte[8192];
        int recordIndex = 0;
        int newLineOffset = 1; // 1 or 2 depending if it is \n or \r\n
        int bytesRead;

        while ((bytesRead = raf.read(buffer)) != -1) {

            String data = new String(buffer, "US-ASCII");

            // determine new line delimiter once
            // can be \n for sd-files also on Windows if
            // the were generated on linux or by certain toolkits
            if (recordIndex == 0) {
                if (getLineSeparator(data).equals("\r\n")) {
                    newLineOffset = 2;
                }
            }

            ArrayList<Integer> recordEnds = new ArrayList<>();
            int index = data.indexOf(DELIMITER);
            while (index >= 0) {
                recordEnds.add(index);
                index = data.indexOf(DELIMITER, index + 1);
            }
            long offsetBeforeRead = raf.getFilePointer() - bytesRead;
            for (int position : recordEnds) {
                // we want to start reading after the delimiter
                // on the next new line
                // aaaaaa
                // $$$$
                // bbbbbb <- get offset were this line starts
                long recordOffset = offsetBeforeRead + position +
DELIMITER.length() + newLineOffset;
                sdfIndex.put(recordIndex, recordOffset);
                recordIndex++;
            }
        }
        // sd-files terminate with DELIMITER
        // hence the last entry in index must be removed as no
        // record will be there.
        sdfIndex.remove(recordIndex - 1);
    }


See appended a full implementation of above idea. Note that it returns text
data only, no chemistry. (works, but not really tested, use at own risk!!!).



2013/9/19 lochana menikarachchi <locha...@yahoo.com>

>
> Joos,
>
> You are right. I should use a local database instead of SD files...
>
> Lochana
>   ------------------------------
>  *From:* Joos Kiener <j...@sunrise.ch>
> *To:* lochana menikarachchi <locha...@yahoo.com>
> *Cc:* "cdk-user@lists.sourceforge.net" <cdk-user@lists.sourceforge.net>
> *Sent:* Thursday, September 19, 2013 9:37 AM
> *Subject:* Re: [Cdk-user] Reading large SD Files
>
> I played a round a bit and came up with a crude solution as I mentioned in
> my initial response.
>
> index all occurrences of "$$$$" -> takes 3-4 seconds for a file with
> 131'000 records
>
> use separate thread to index to increase performance but current
> implementation requires that index is fully built. This is an issue as you
> need to have 2 access mechanisms, index based and not-index based.
>
>
> Use BufferedReader to go to the indexed line, eg
>
> for (int i = 0; i < linesToRead; i++) {
>                 bufferedReader.readLine();
>             }
>
> yeah, not ideal but it actually is faster than I expected.
>
> add caching to it.
>
>
> But a question remains:
>
> What is your actual goal? Why can't you use Marvin, for commercial use?
> 1 million is a lot. Using a real database comes to mind.
>
>
>
> 2013/9/19 lochana menikarachchi <locha...@yahoo.com>
>
> Hi Nina,
>
> I did try the RandomAccessSDFReader. It took few minutes to build the
> index for an SD file with 50,000 structures. What I am saying is what ever
> MarvinView does to build index (if it is using an index) is much faster. I
> am wondering how it does that.
>
> Lochana
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
> SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
> includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
> _______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
> SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
> includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
> _______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>

ScrollingSdfReader.7z
Description: Binary data

------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk

_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Reading large SD Files

Reply via email to