I'm sorry I didn't look at this earlier. There is one serious architectural 
problem preventing us merging this (which otherwise is great stuff that will 
save us a significant amount of disk I/O and disk space even when not 
filtering):

WE MUST NOT ACCESS THE DATABASE ON ANY OTHER THREAD THAN THE DATABASE THREAD!

This is a basic rule of the current client layer. Changing it would be 
massively non-trivial given activation issues etc, and as db4o uses separate 
object caches for each parallel client, might not boost performance as much as 
you might think.

In the SingleFileFetcher case we can deal with this easily enough: Just call 
onSuccesss(result.asBucket().getInputStream()).

In the SplitFileFetcher case this is way more complex. We have two options:

1. Reverse the flow: Run the writeDecodedData()'s on the database thread, 
passing it in as a Generator object with a writeTo(OutputStream, container, 
context) method. Create a pair of pipes, feed the input half to the 
decompression thread manager, and call writeTo(), on the database thread, to 
write to the output. This implies that hash checking, filtering and in 
particular writing the final data to a bucket, occurs on a separate thread, at 
the end of the decompression chain, so we would need another pair of pipes on 
the other end of the decompression thread manager.
2. Don't reverse the flow: The simplest implementation would be to create a 
shadow bucket (getShadow(), a non-persistent bucket with the same contents) for 
every block to be copied, and then dump them out on a separate thread as now. 
The problem is this could take a lot of memory, so for larger files we would 
have to schedule a database job to grab the next group of shadow buckets every 
so often.

IMHO the first strategy is probably better: Create a pair of pipes at the 
front, and a pair of pipes at the back. The back end runs hashing, filtering 
and writing to the final bucket (which can be persistent, you can pass an 
OutputStream for a persistent bucket to another thread as long as you ensure 
it's not deactivated in the meantime). The front end is a writeTo() called on 
the database thread. The front end feeds into the middle (decompressors) which 
then feed into the back end. Once writeTo() returns we wait for the whole thing 
to finish.

My summary of the changes so far:

Reviewed up to 2c18269615cf51c24f1826ee8ee1e59caca730af

sajack/ContentFilterDecompressor:
- Pass a List of Compressor's into ClientGetter.onSuccess(). Do the 
decompression there.
- DecompressorThreadManager: Pipe everything (decompression, possibly multiple 
counts) on separate threads, and return an InputStream. Hence chain final data 
bucket -> decompressor -> any more decompressors -> filtering -> hash.
- Remove decompress(bucket, bucketfactory, ...) API. Always use stream to 
stream API.
- Simplify unit tests slightly.
- Use max length when creating final bucket.
- Cleanups.
- Pass an InputStream directly into ClientGetter and thus chain the final 
segment reassembly to everything else.
- Remove some unnecessary checks, let JPEG filter throw EOF if it needs to.

9be670dcd1f7a76acce49de844e95135cc4bd5f5
- another old bug, why do we pass maxLen * 4 ? is this max read?? do we 
actually check written length here?
- same old code is duplicated in archive fetcher callback, probably multi level 
meta fetcher callback too
- chaining to onFailure()'s - should return
- archive fetcher callback: Bucket orig = result.asBucket - can move out of 
loop, but doesn't matter? same for multi-level
- archive fetcher callback: should return after calling onFailure. same for 
multi-level.
- SplitFileFetcher.finalStatus: we could still use returnBucket if we are not 
compressed...
- SplitFileFetcher.finish: Don't run onFailure then onSuccess

089d7dc4e6ddb518062da2e29287a400f9604a22
DecompressorThreadManager
- maxLen * 4 still there, what's it for?
- Fix exception handling

a31064ef1ff74ad5e94273089df2e1321034422a
- ClientGetter exception handling
- ClientGetter: fix the maxlen calculation? can we pass in temp length for each 
temp and output length for the last one?

a31064ef1ff74ad5e94273089df2e1321034422a
- *better* exception handling! ioe and compressionsizeexception are expected, 
and they have specific error codes to handle them

33c8bfc3d17e3eaf6a808f2b097b5f2be5910656
- DecompressorThread: null out input and output on successful close

e27a872923c04e4966e80afdaffc17b613c2067e
- bzipcompressor, gzipcompressor: bytes is unnecessary, isn't it?

44e93f1dec58d541dd2920d646311c9f94262555
- throw getError() inside catch which doesn't throw? doesn't make much sense...

e9e118051d3ef3d46be83f53a65fe83ff98de4d6
- SimpleSingleFileFetcher.onSuccess broken
- MAJOR PERSISTENCE BUG: finalStatus() writer *CANNOT RUN ON ANOTHER THREAD 
BECAUSE IT ACCESSES THE DATABASE*. We need to pass in a closure or similar - 
run the generator on the database thread, run everything else, up to and 
including writing to the final bucket, on other threads. So the architecture is 
backwards at the moment. Or we could have it run forwards as it does now, but 
we'd need to return a chain of shadow buckets, and dump them to the first POS 
off-thread, but periodically come back for more.
- Also exception handling.

6fcbbdad4cfdb787d98d2370cd84d59ff8401c49
- can get deactivated, use a shadow bucket and it will work fine

dbf48b90872bd4103d120d84290744ac9c5cd95f
- wrong, we need to free the data as early as possible
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20100728/8a51fadf/attachment.pgp>

Reply via email to