Re: [Cdk-user] Reading large SD Files

Joos Kiener Fri, 20 Sep 2013 06:51:15 -0700

Hi Nina,

yes I agree it's non-trivial due to java's file IO not being a very good
API.


getLineSeparator(String data) in my code is a private method that looks
if the past in string contains \r\n if yes, that is chosen as line
separator else \n.

So it works for all files that have a consistent separator. But anyway
it's just an example not meant for any production use. (there are other
issues I'm aware of).

Best Regards,

Joos

Am 20.09.2013 14:23, schrieb Nina Jeliazkova:
> Hi Joos, All,
>
>
> On 20 September 2013 15:08, Joos Kiener <j...@sunrise.ch
> <mailto:j...@sunrise.ch>> wrote:
>
>     Hi all,
>
>     the issue with RandomAccessSDFReader, more specifically the
>     underlying RandomAccessReader is that it uses RandomAccessFiles
>     readLine() method. That method is very, very bad in terms of
>     performance (because it's badly written). Hence the indexing takes
>     very long. A solution would be to rewrite the index method without
>     using the read line method.
>
>
>
> True. With a caveat (as already I've pointed in a personal reply to
> Andrew Dalke, who commented with the same reasoning == readline is bad).
>
> Java readline readers are slow, but they provide transparency with the
> respect of different line separators (e.g. \n , \r\n, \r ) that
> originate from different operating systems.  
>
> Specifically, it is not correct to search for only \n$$$$\n nor only
> for  "\r\n"
>
> In SD files there could be all combinations of line separators not
> only in one file, but within one record. I could only guess how these
> files have been constructed, but they do exist in the wild.  And the
> reason of using the Java line reader is exactly this. Of course it
> could be rewritten to match explicitly bytes without relying on
> existing Java classes. 
>
> Unrelated to performance - using getLineSeparator() in this context is
> not quite right, as this method will return the OS specific line
> separator, while the SD file being read may have been generated on a
> different OS.
>  
>
>     I have not looked at the indexing method and what exactly it does
>     but here is a way to index the start (as a byte offset) of every
>     record in an sd-file into a Map<Integer,Long>. see below.
>     Indexing this takes 3-4 seconds (on an SSD...) with a 540 mb
>     sd-file (from ZINC) containing aprox. 131'000 records.
>     hence that would be about 30 sec. for 1 mio compounds.
>
>
>     private void index() throws IOException {
>
>             sdfIndex.put(0, 0L); // first record
>             byte[] buffer = new byte[8192];
>             int recordIndex = 0;
>             int newLineOffset = 1; // 1 or 2 depending if it is \n or \r\n
>             int bytesRead;
>
>             while ((bytesRead = raf.read(buffer)) != -1) {
>
>                 String data = new String(buffer, "US-ASCII");
>
>                 // determine new line delimiter once
>                 // can be \n for sd-files also on Windows if
>                 // the were generated on linux or by certain toolkits
>                 if (recordIndex == 0) {
>                     if (getLineSeparator(data).equals("\r\n")) {
>                         newLineOffset = 2;
>                     }
>                 }
>
>                 ArrayList<Integer> recordEnds = new ArrayList<>();
>                 int index = data.indexOf(DELIMITER);
>                 while (index >= 0) {
>                     recordEnds.add(index);
>                     index = data.indexOf(DELIMITER, index + 1);
>                 }
>                 long offsetBeforeRead = raf.getFilePointer() - bytesRead;
>                 for (int position : recordEnds) {
>                     // we want to start reading after the delimiter
>                     // on the next new line
>                     // aaaaaa
>                     // $$$$
>                     // bbbbbb <- get offset were this line starts
>                     long recordOffset = offsetBeforeRead + position +
>     DELIMITER.length() + newLineOffset;
>                     sdfIndex.put(recordIndex, recordOffset);
>                     recordIndex++;
>                 }
>             }
>             // sd-files terminate with DELIMITER
>             // hence the last entry in index must be removed as no
>             // record will be there.
>             sdfIndex.remove(recordIndex - 1);
>         }
>
>
>     See appended a full implementation of above idea. Note that it
>     returns text data only, no chemistry. (works, but not really
>     tested, use at own risk!!!).
>
>
>
> It is just great that you took the time to improve that code. It's
> quite old already (written ~2007 and was intended for indexing a file
> at about 40K compounds originally, so never tested on 1 mln...). And
> in fact not really used anymore , at least by the original author :) 
>
>
>  
>
>
>     2013/9/19 lochana menikarachchi <locha...@yahoo.com
>     <mailto:locha...@yahoo.com>>
>
>
>         Joos,
>
>         You are right. I should use a local database instead of SD
>         files...
>
>
>
> Indeed. This number of records is good for testing performance, but
> for real use I would vote for a database. 
>
> Best regards,
> Nina
>
>  
>
>
>         Lochana
>         
> ------------------------------------------------------------------------
>         *From:* Joos Kiener <j...@sunrise.ch <mailto:j...@sunrise.ch>>
>         *To:* lochana menikarachchi <locha...@yahoo.com
>         <mailto:locha...@yahoo.com>>
>         *Cc:* "cdk-user@lists.sourceforge.net
>         <mailto:cdk-user@lists.sourceforge.net>"
>         <cdk-user@lists.sourceforge.net
>         <mailto:cdk-user@lists.sourceforge.net>>
>         *Sent:* Thursday, September 19, 2013 9:37 AM
>         *Subject:* Re: [Cdk-user] Reading large SD Files
>
>         I played a round a bit and came up with a crude solution as I
>         mentioned in my initial response.
>
>         index all occurrences of "$$$$" -> takes 3-4 seconds for a
>         file with 131'000 records
>
>         use separate thread to index to increase performance but
>         current implementation requires that index is fully built.
>         This is an issue as you need to have 2 access mechanisms,
>         index based and not-index based.
>
>
>         Use BufferedReader to go to the indexed line, eg
>
>         for (int i = 0; i < linesToRead; i++) {
>                         bufferedReader.readLine();
>                     }
>
>         yeah, not ideal but it actually is faster than I expected.
>
>         add caching to it.
>
>
>         But a question remains:
>
>         What is your actual goal? Why can't you use Marvin, for
>         commercial use?
>         1 million is a lot. Using a real database comes to mind.
>
>
>
>         2013/9/19 lochana menikarachchi <locha...@yahoo.com
>         <mailto:locha...@yahoo.com>>
>
>             Hi Nina,
>
>             I did try the RandomAccessSDFReader. It took few minutes
>             to build the index for an SD file with 50,000 structures.
>             What I am saying is what ever MarvinView does to build
>             index (if it is using an index) is much faster. I am
>             wondering how it does that.
>
>             Lochana
>
>             
> ------------------------------------------------------------------------------
>             LIMITED TIME SALE - Full Year of Microsoft Training For
>             Just $49.99!
>             1,500+ hours of tutorials including VisualStudio 2012,
>             Windows 8, SharePoint
>             2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library
>             Power Pack includes
>             Mobile, Cloud, Java, and UX Design. Lowest price ever!
>             Ends 9/20/13.
>             
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
>             _______________________________________________
>             Cdk-user mailing list
>             Cdk-user@lists.sourceforge.net
>             <mailto:Cdk-user@lists.sourceforge.net>
>             https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
>
>
>
>         
> ------------------------------------------------------------------------------
>         LIMITED TIME SALE - Full Year of Microsoft Training For Just
>         $49.99!
>         1,500+ hours of tutorials including VisualStudio 2012, Windows
>         8, SharePoint
>         2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library
>         Power Pack includes
>         Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends
>         9/20/13.
>         
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
>         _______________________________________________
>         Cdk-user mailing list
>         Cdk-user@lists.sourceforge.net
>         <mailto:Cdk-user@lists.sourceforge.net>
>         https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
>
>     
> ------------------------------------------------------------------------------
>     LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
>     1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
>     SharePoint
>     2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power
>     Pack includes
>     Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
>     
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
>     _______________________________________________
>     Cdk-user mailing list
>     Cdk-user@lists.sourceforge.net <mailto:Cdk-user@lists.sourceforge.net>
>     https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>

------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk

_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Reading large SD Files

Reply via email to