Hi Nina,
yes I agree it's non-trivial due to java's file IO not being a very good
API.
getLineSeparator(String data) in my code is a private method that looks
if the past in string contains \r\n if yes, that is chosen as line
separator else \n.
So it works for all files that have a consistent separator. But anyway
it's just an example not meant for any production use. (there are other
issues I'm aware of).
Best Regards,
Joos
Am 20.09.2013 14:23, schrieb Nina Jeliazkova:
> Hi Joos, All,
>
>
> On 20 September 2013 15:08, Joos Kiener <j...@sunrise.ch
> <mailto:j...@sunrise.ch>> wrote:
>
> Hi all,
>
> the issue with RandomAccessSDFReader, more specifically the
> underlying RandomAccessReader is that it uses RandomAccessFiles
> readLine() method. That method is very, very bad in terms of
> performance (because it's badly written). Hence the indexing takes
> very long. A solution would be to rewrite the index method without
> using the read line method.
>
>
>
> True. With a caveat (as already I've pointed in a personal reply to
> Andrew Dalke, who commented with the same reasoning == readline is bad).
>
> Java readline readers are slow, but they provide transparency with the
> respect of different line separators (e.g. \n , \r\n, \r ) that
> originate from different operating systems.
>
> Specifically, it is not correct to search for only \n$$$$\n nor only
> for "\r\n"
>
> In SD files there could be all combinations of line separators not
> only in one file, but within one record. I could only guess how these
> files have been constructed, but they do exist in the wild. And the
> reason of using the Java line reader is exactly this. Of course it
> could be rewritten to match explicitly bytes without relying on
> existing Java classes.
>
> Unrelated to performance - using getLineSeparator() in this context is
> not quite right, as this method will return the OS specific line
> separator, while the SD file being read may have been generated on a
> different OS.
>
>
> I have not looked at the indexing method and what exactly it does
> but here is a way to index the start (as a byte offset) of every
> record in an sd-file into a Map<Integer,Long>. see below.
> Indexing this takes 3-4 seconds (on an SSD...) with a 540 mb
> sd-file (from ZINC) containing aprox. 131'000 records.
> hence that would be about 30 sec. for 1 mio compounds.
>
>
> private void index() throws IOException {
>
> sdfIndex.put(0, 0L); // first record
> byte[] buffer = new byte[8192];
> int recordIndex = 0;
> int newLineOffset = 1; // 1 or 2 depending if it is \n or \r\n
> int bytesRead;
>
> while ((bytesRead = raf.read(buffer)) != -1) {
>
> String data = new String(buffer, "US-ASCII");
>
> // determine new line delimiter once
> // can be \n for sd-files also on Windows if
> // the were generated on linux or by certain toolkits
> if (recordIndex == 0) {
> if (getLineSeparator(data).equals("\r\n")) {
> newLineOffset = 2;
> }
> }
>
> ArrayList<Integer> recordEnds = new ArrayList<>();
> int index = data.indexOf(DELIMITER);
> while (index >= 0) {
> recordEnds.add(index);
> index = data.indexOf(DELIMITER, index + 1);
> }
> long offsetBeforeRead = raf.getFilePointer() - bytesRead;
> for (int position : recordEnds) {
> // we want to start reading after the delimiter
> // on the next new line
> // aaaaaa
> // $$$$
> // bbbbbb <- get offset were this line starts
> long recordOffset = offsetBeforeRead + position +
> DELIMITER.length() + newLineOffset;
> sdfIndex.put(recordIndex, recordOffset);
> recordIndex++;
> }
> }
> // sd-files terminate with DELIMITER
> // hence the last entry in index must be removed as no
> // record will be there.
> sdfIndex.remove(recordIndex - 1);
> }
>
>
> See appended a full implementation of above idea. Note that it
> returns text data only, no chemistry. (works, but not really
> tested, use at own risk!!!).
>
>
>
> It is just great that you took the time to improve that code. It's
> quite old already (written ~2007 and was intended for indexing a file
> at about 40K compounds originally, so never tested on 1 mln...). And
> in fact not really used anymore , at least by the original author :)
>
>
>
>
>
> 2013/9/19 lochana menikarachchi <locha...@yahoo.com
> <mailto:locha...@yahoo.com>>
>
>
> Joos,
>
> You are right. I should use a local database instead of SD
> files...
>
>
>
> Indeed. This number of records is good for testing performance, but
> for real use I would vote for a database.
>
> Best regards,
> Nina
>
>
>
>
> Lochana
>
> ------------------------------------------------------------------------
> *From:* Joos Kiener <j...@sunrise.ch <mailto:j...@sunrise.ch>>
> *To:* lochana menikarachchi <locha...@yahoo.com
> <mailto:locha...@yahoo.com>>
> *Cc:* "cdk-user@lists.sourceforge.net
> <mailto:cdk-user@lists.sourceforge.net>"
> <cdk-user@lists.sourceforge.net
> <mailto:cdk-user@lists.sourceforge.net>>
> *Sent:* Thursday, September 19, 2013 9:37 AM
> *Subject:* Re: [Cdk-user] Reading large SD Files
>
> I played a round a bit and came up with a crude solution as I
> mentioned in my initial response.
>
> index all occurrences of "$$$$" -> takes 3-4 seconds for a
> file with 131'000 records
>
> use separate thread to index to increase performance but
> current implementation requires that index is fully built.
> This is an issue as you need to have 2 access mechanisms,
> index based and not-index based.
>
>
> Use BufferedReader to go to the indexed line, eg
>
> for (int i = 0; i < linesToRead; i++) {
> bufferedReader.readLine();
> }
>
> yeah, not ideal but it actually is faster than I expected.
>
> add caching to it.
>
>
> But a question remains:
>
> What is your actual goal? Why can't you use Marvin, for
> commercial use?
> 1 million is a lot. Using a real database comes to mind.
>
>
>
> 2013/9/19 lochana menikarachchi <locha...@yahoo.com
> <mailto:locha...@yahoo.com>>
>
> Hi Nina,
>
> I did try the RandomAccessSDFReader. It took few minutes
> to build the index for an SD file with 50,000 structures.
> What I am saying is what ever MarvinView does to build
> index (if it is using an index) is much faster. I am
> wondering how it does that.
>
> Lochana
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For
> Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012,
> Windows 8, SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library
> Power Pack includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever!
> Ends 9/20/13.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
> _______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> <mailto:Cdk-user@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just
> $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows
> 8, SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library
> Power Pack includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends
> 9/20/13.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
> _______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> <mailto:Cdk-user@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
> SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power
> Pack includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
> _______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net <mailto:Cdk-user@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user