Re: [polyml] Reducing the size of on-disk saved state

Matthew Fernandez Wed, 24 Feb 2016 17:25:07 -0800

On 18/02/16 05:34, David Matthews wrote:

On 17/02/2016 01:14, Matthew Fernandez wrote:

On 16/02/16 00:12, David Matthews wrote:
 > From a quick look at the code the main effect that child states have
is that StateLoader::LoadFile needs to seek
 > within the saved state file to get the name of the parent file.  That
has to be loaded before the child because the
 > child may, almost certainly will, overwrite some of the parent data.
That may affect how you compact the data.
 > How well do the compression libraries cope with seeking within the file?


Admittedly this is not something I had thought to look for, and now that
I do I note there are seeks performed during state *saving* as well
where Poly/ML overwrites data at the start of the stream. A cursory
glance at the LZO API makes it seem like a requirement to seek the
stream may well be a deal breaker... Rafal Kolanski, can you comment any
more on this?


It may be possible to rework the code to avoid the seeks.  Perhaps it would be 
easier to compact each section of the
data separately rather than process the file as a whole, if that's possible.  
The seeks are just to move between sections.

Of more concern is that LZO is licensed under GPL rather than LGPL. Poly/ML is 
licensed under LGPL and that means that
it cannot include or even link to LZO without coming under GPL.  That doesn't 
preclude experimenting with it but for
distribution I'd prefer a library that didn't have these problems.

David


I think the salient points at this stage are the following:

1. Poly/ML performs seeks on the save path to rewind the FD and update metadata including byte offsets of othersections in the file. Here I'm referring to SaveRequest::Perform.

  2. LZO is GPL v2, while Poly/ML is LGPL v2.1. Thanks David and Rob for 
correcting me; I had misread the licence.

3. LZO streams do not appear to be seekable. Gzip streams seem seekable only for reading, and this is acknowledged tobe slow.

The naïve ways I can see of working around (1) are either (a) construct the entire state in memory first then stream itout to a compressed file, (b) effectively run the save state logic twice to predict the offset values in the first passso the second pass that does the actual writing can run linearly uninterrupted or (c) write the state out then compressit to a second file and delete the first. None of these are particularly palatable to me. David, you mentioned that itmight be possible to avoid the seeks. Did you have a different idea?

As you've noted, there are also seeks on the load path, but to me this is a lesser hurdle to overcome than the seeks onthe save path.

As for (2), the licensing issue... this appears to be a show stopper for using LZO. As I've said, I'm not wed to anyparticular compression algorithm, so I'm happy to revert to Gzip or another suggestion if there's one. For my own usecase, my precious resources are RAM and disk space. Runtime is not a concern to me as this operation is already dwarfedby other things on my critical path. I suspect this is not the case for others, so it may make sense to implement thisas an opt-in feature. As always, any and all comments welcome.

_______________________________________________
polyml mailing list
[email protected]
http://lists.inf.ed.ac.uk/mailman/listinfo/polyml

Re: [polyml] Reducing the size of on-disk saved state

Reply via email to