Hi, The problem is: "process the binary only once".
With 'process' we said 'text extraction', but it could be 'virus scan', 'index', 'create a thumbnail', 'transfer' (to the client or from the client), or 'backup' - any expensive task. I believe a good solution is to provide the object identity to the module (the text extraction engine, virus scanner, and so on), so that the module can decide itself what to do. Instead of returning an InputStream, Jackrabbit would return a DataStoreInputStream with the additional method getDataIdentifier(). Then the module can read the identifier, check if the item is already processed, and avoid reading the data itself if this identifier is already processed. I believe that would be a flexible solution. How the module stores the data for this object (the meta data) is module specific. I don't think the best solution is to always store it in a file or stream close to the binary. For text extraction, a separate file may make sense, but probably not for 'virus scan' because that's only a flag (you don't need the data). Thumbnails: for better performance you want to keep them together, and not save them separately (that is, in the data store). Regards, Thomas
