I guess what I'm really looking for is "derivative", since in the OCR case, 
the text is a derivative of the image, and in the PDF case the image is a 
derivative of a multi-page PDF. Does something like that exist already?


On Thursday, March 9, 2017 at 9:44:40 PM UTC-8, Steve Armstrong wrote:
>
> I'm working a bit on the scanningcabinet port and have two data model 
> questions that might be related, and might be generic enough for camlistore 
> in general to be a single "dataSource" predicate:
>
> 1. How do I store a text blob created from an image
>
> OCR is expensive (maybe lots of CPU cycles locally, maybe I actually pay 
> to pass it through a service) so I don't want the text stored in an index 
> where each server must extract it. A full-text index could search through 
> these blobs, but I want to only create the blob once across my network of 
> camlistores. It's also lossy, so the canonical source is still the image, 
> and that's where the permanode should be.
>
> 2. How do I store an image extracted from a PDF
>
> In this case, I'm pulling a PDF apart by generating an image for each 
> page. The images have permanodes and tags, so they are their own object. 
> They are lossy though, so I might need to refer to the parent PDF to read 
> something properly.
>
>
> Is there a concept of "camlistore:dataSource" or something that I should 
> be using?
>

-- 
You received this message because you are subscribed to the Google Groups 
"Camlistore" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to