I guess what I'm really looking for is "derivative", since in the OCR case, the text is a derivative of the image, and in the PDF case the image is a derivative of a multi-page PDF. Does something like that exist already?
On Thursday, March 9, 2017 at 9:44:40 PM UTC-8, Steve Armstrong wrote: > > I'm working a bit on the scanningcabinet port and have two data model > questions that might be related, and might be generic enough for camlistore > in general to be a single "dataSource" predicate: > > 1. How do I store a text blob created from an image > > OCR is expensive (maybe lots of CPU cycles locally, maybe I actually pay > to pass it through a service) so I don't want the text stored in an index > where each server must extract it. A full-text index could search through > these blobs, but I want to only create the blob once across my network of > camlistores. It's also lossy, so the canonical source is still the image, > and that's where the permanode should be. > > 2. How do I store an image extracted from a PDF > > In this case, I'm pulling a PDF apart by generating an image for each > page. The images have permanodes and tags, so they are their own object. > They are lossy though, so I might need to refer to the parent PDF to read > something properly. > > > Is there a concept of "camlistore:dataSource" or something that I should > be using? > -- You received this message because you are subscribed to the Google Groups "Camlistore" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
