hi guys, for my understanding ...
>Probably not in the Lucene index files itself. Text extraction could be used >without using the Lucene index, for example to display the text content of a >>PDF file. The text extraction module could store the DataIdentifier together >with the extracted text ('payload'). The advantage to store this 'payload' >>near the actual binary is that the data is deleted when the binary is garbage >collected. So maybe it actually is better to store the 'payload' (extracted >>text, virus scanner flag, thumbnail) near the binary, so it is automatically >garbage collected when the binary is garbage collected. We would need to >>define an API and the behavior for this 'payload storage'. It probably >doesn't need to be transactional, but it needs to be consistent (a checksum). >>Some kind of binary properties file maybe, with put(String key, InputStream >payload), and InputStream get(String key). ... thomas you are talking about a textextraction module but i can not follow you, As far as i understand, you will change the architectur going to modules such like textactraction, viruscanner or thumbnailbuilder so they can store the informations with the dataidentifier and the result in the datastore ? if the gc will delete a entry in the datastore based on the dataIdentifier the "near" informations will also be deleted automatilally ? as you wrote the main problem is that we do not know if we have already processed a binary it would be fine if we internally create a dataIdentifier of a binary stream and give them to such modules like textextractor or what else so they can look for a result that was already be processed and stored. is that also what you think ;-)