hi guys,

for my understanding ...

>Probably not in the Lucene index files itself. Text extraction could be used 
>without using the Lucene index, for example to display the text content of a 
>>PDF file. The text extraction module could store the DataIdentifier together 
>with the extracted text ('payload'). The advantage to store this 'payload' 
>>near the actual binary is that the data is deleted when the binary is garbage 
>collected. So maybe it actually is better to store the 'payload' (extracted 
>>text, virus scanner flag, thumbnail) near the binary, so it is automatically 
>garbage collected when the binary is garbage collected. We would need to 
>>define an API and the behavior for this 'payload storage'. It probably 
>doesn't need to be transactional, but it needs to be consistent (a checksum). 
>>Some kind of binary properties file maybe, with put(String key, InputStream 
>payload), and InputStream get(String key).

... thomas you are talking about a textextraction module but i can not follow 
you,
As far as i understand, you will change the architectur going to modules
such like textactraction, viruscanner or thumbnailbuilder so they can store the 
informations with
the dataidentifier and the result in the datastore ?
if the gc will delete a entry in the datastore based on the dataIdentifier the
"near" informations will also be deleted automatilally ?

as you wrote the main problem is that we do not know if we have already 
processed a binary
it would be fine if we internally create a dataIdentifier of a binary stream 
and give them
to such modules like textextractor or what else so they can look for a result 
that
was already be processed and stored.

is that also what you think ;-)

Reply via email to