On 01/02/11 16:36, Stephen Allen wrote:
Andy,
I have started implementing the serializer (SinkBindingOutput) by using
org.openjena.riot.SinkQuadOutput as a guide and using OutputLangUtils to
print out the variable/values. I created the deserializer (LangBindings) by
extending org.openjena.riot.lang.LangNTuple. I'm using the paired var/value
format you described below. For now I'll start with a straightforward
implementation with no compression, but like your ideas in this area. I'll
try to do some measurements to see if any other compression is beneficial.
Sounds good.
I did not define an org.openjena.riot.Lang enum for the deserializer
(because it isn't an RDF language) but I was planning on putting the
LangBindings class in the org.openjena.riot.lang package.
As good a place as any at the moment.
I've just digging out some code that does tuple I/O from an
experiemental system a while ago (a clustered query engine ..).
For determining when to spill bindings to disk, there are a few options (in
order of least difficulty):
1) Store binding objects in an list, and then spill them to disk once the
list size passes a threshold
2) Start serializing bindings immediately into something like
DeferredFileOutputStream [1] that will retain the data in memory until it
passes a memory threshold
3) Do 1), but try to calculate the size of the bindings in memory and use a
memory threshold instead of a number of bindings threshold
I think 1) should be sufficient if we come up with a reasonable guess for
the threshold. Option 2) lets you get much better control over the memory
management, but I think the cost of unnecessarily serializing/deserializing
small queries may be too high.
Persoanlly, I'd encapsulate this in a policy object and have different
implementations. Well, may just one implementation - case 1 with a
settable threshold for testing. (3) then becomes a smarter policy
object to be done later, if needed.
I share your concern on (2) about the serialization to memory costs.
Andy