Hi Joos, All,

On 20 September 2013 15:08, Joos Kiener <j...@sunrise.ch> wrote:

> Hi all,
>
> the issue with RandomAccessSDFReader, more specifically the underlying
> RandomAccessReader is that it uses RandomAccessFiles readLine() method.
> That method is very, very bad in terms of performance (because it's badly
> written). Hence the indexing takes very long. A solution would be to
> rewrite the index method without using the read line method.
>
>
>
True. With a caveat (as already I've pointed in a personal reply to Andrew
Dalke, who commented with the same reasoning == readline is bad).

Java readline readers are slow, but they provide transparency with the
respect of different line separators (e.g. \n , \r\n, \r ) that originate
from different operating systems.

Specifically, it is not correct to search for only \n$$$$\n nor only for
 "\r\n"

In SD files there could be all combinations of line separators not only in
one file, but within one record. I could only guess how these files have
been constructed, but they do exist in the wild.  And the reason of using
the Java line reader is exactly this. Of course it could be rewritten to
match explicitly bytes without relying on existing Java classes.

Unrelated to performance - using getLineSeparator() in this context is not
quite right, as this method will return the OS specific line separator,
while the SD file being read may have been generated on a different OS.


> I have not looked at the indexing method and what exactly it does but here
> is a way to index the start (as a byte offset) of every record in an
> sd-file into a Map<Integer,Long>. see below.
> Indexing this takes 3-4 seconds (on an SSD...) with a 540 mb sd-file (from
> ZINC) containing aprox. 131'000 records.
> hence that would be about 30 sec. for 1 mio compounds.
>
>
> private void index() throws IOException {
>
>         sdfIndex.put(0, 0L); // first record
>         byte[] buffer = new byte[8192];
>         int recordIndex = 0;
>         int newLineOffset = 1; // 1 or 2 depending if it is \n or \r\n
>         int bytesRead;
>
>         while ((bytesRead = raf.read(buffer)) != -1) {
>
>             String data = new String(buffer, "US-ASCII");
>
>             // determine new line delimiter once
>             // can be \n for sd-files also on Windows if
>             // the were generated on linux or by certain toolkits
>             if (recordIndex == 0) {
>                 if (getLineSeparator(data).equals("\r\n")) {
>                     newLineOffset = 2;
>                 }
>             }
>
>             ArrayList<Integer> recordEnds = new ArrayList<>();
>             int index = data.indexOf(DELIMITER);
>             while (index >= 0) {
>                 recordEnds.add(index);
>                 index = data.indexOf(DELIMITER, index + 1);
>             }
>             long offsetBeforeRead = raf.getFilePointer() - bytesRead;
>             for (int position : recordEnds) {
>                 // we want to start reading after the delimiter
>                 // on the next new line
>                 // aaaaaa
>                 // $$$$
>                 // bbbbbb <- get offset were this line starts
>                 long recordOffset = offsetBeforeRead + position +
> DELIMITER.length() + newLineOffset;
>                 sdfIndex.put(recordIndex, recordOffset);
>                 recordIndex++;
>             }
>         }
>         // sd-files terminate with DELIMITER
>         // hence the last entry in index must be removed as no
>         // record will be there.
>         sdfIndex.remove(recordIndex - 1);
>     }
>
>
> See appended a full implementation of above idea. Note that it returns
> text data only, no chemistry. (works, but not really tested, use at own
> risk!!!).
>
>
>
It is just great that you took the time to improve that code. It's quite
old already (written ~2007 and was intended for indexing a file at about
40K compounds originally, so never tested on 1 mln...). And in fact not
really used anymore , at least by the original author :)




>
> 2013/9/19 lochana menikarachchi <locha...@yahoo.com>
>
>>
>> Joos,
>>
>> You are right. I should use a local database instead of SD files...
>>
>

Indeed. This number of records is good for testing performance, but for
real use I would vote for a database.

Best regards,
Nina



>
>> Lochana
>>   ------------------------------
>>  *From:* Joos Kiener <j...@sunrise.ch>
>> *To:* lochana menikarachchi <locha...@yahoo.com>
>> *Cc:* "cdk-user@lists.sourceforge.net" <cdk-user@lists.sourceforge.net>
>> *Sent:* Thursday, September 19, 2013 9:37 AM
>> *Subject:* Re: [Cdk-user] Reading large SD Files
>>
>> I played a round a bit and came up with a crude solution as I mentioned
>> in my initial response.
>>
>> index all occurrences of "$$$$" -> takes 3-4 seconds for a file with
>> 131'000 records
>>
>> use separate thread to index to increase performance but current
>> implementation requires that index is fully built. This is an issue as you
>> need to have 2 access mechanisms, index based and not-index based.
>>
>>
>> Use BufferedReader to go to the indexed line, eg
>>
>> for (int i = 0; i < linesToRead; i++) {
>>                 bufferedReader.readLine();
>>             }
>>
>> yeah, not ideal but it actually is faster than I expected.
>>
>> add caching to it.
>>
>>
>> But a question remains:
>>
>> What is your actual goal? Why can't you use Marvin, for commercial use?
>> 1 million is a lot. Using a real database comes to mind.
>>
>>
>>
>> 2013/9/19 lochana menikarachchi <locha...@yahoo.com>
>>
>> Hi Nina,
>>
>> I did try the RandomAccessSDFReader. It took few minutes to build the
>> index for an SD file with 50,000 structures. What I am saying is what ever
>> MarvinView does to build index (if it is using an index) is much faster. I
>> am wondering how it does that.
>>
>> Lochana
>>
>>
>> ------------------------------------------------------------------------------
>> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
>> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
>> SharePoint
>> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
>> includes
>> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Cdk-user mailing list
>> Cdk-user@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>>
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
>> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
>> SharePoint
>> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
>> includes
>> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Cdk-user mailing list
>> Cdk-user@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>>
>>
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
> SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
> includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
> _______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to