I realized I didn't respond to another interesting bit from your second message: the idea of marking objects as "derived" from others. I usually reach for Dublin Core for this kind of stuff. Seems like there could be an argument for using http://dublincore.org/documents/dcmi-terms/#terms-hasFormat (the provenance then establishes it as being the result of OCR vs input by a human)
On Sun, Mar 19, 2017 at 12:53 AM, Eric Drechsel <[email protected]> wrote: > The index is mostly used to cache extracted "shallow" metadata from file > headers etc, and to normalize different metadata representations (EXIF GPS > tags + Camlistore Location attrs). > > It's a poor fit for OCR'd text, not only because it's computationally > expensive, but (relatedly) because results are generally "inexact" and may > change depending on the extractor used (local Tesseract model, Google's OCR > service, or a human reader) or even between runs (models get better with > training data). > > I think this is ultimately a problem of provenance: "who said what when?". > Thankfully Camlistore makes provenance first class: every attribute change > is signed by its creator. > > The computer vision system could be given its own signing key and store, > and then just configure the indexer to accept changes from both the user > and this other source. If updated results become available, the CV system > can update the record just like a human. > > Anyway, right now you just need semantics to express "here is the text > from this image". It sounds sort of like camliContentImage ( > https://camlistore.org/doc/schema/attributes) which is used to store > thumbnails. How about camliContentText? > > On Sat, Mar 18, 2017 at 10:34 PM, Steve Armstrong < > [email protected]> wrote: > >> I guess what I'm really looking for is "derivative", since in the OCR >> case, the text is a derivative of the image, and in the PDF case the image >> is a derivative of a multi-page PDF. Does something like that exist already? >> >> >> On Thursday, March 9, 2017 at 9:44:40 PM UTC-8, Steve Armstrong wrote: >>> >>> I'm working a bit on the scanningcabinet port and have two data model >>> questions that might be related, and might be generic enough for camlistore >>> in general to be a single "dataSource" predicate: >>> >>> 1. How do I store a text blob created from an image >>> >>> OCR is expensive (maybe lots of CPU cycles locally, maybe I actually pay >>> to pass it through a service) so I don't want the text stored in an index >>> where each server must extract it. A full-text index could search through >>> these blobs, but I want to only create the blob once across my network of >>> camlistores. It's also lossy, so the canonical source is still the image, >>> and that's where the permanode should be. >>> >>> 2. How do I store an image extracted from a PDF >>> >>> In this case, I'm pulling a PDF apart by generating an image for each >>> page. The images have permanodes and tags, so they are their own object. >>> They are lossy though, so I might need to refer to the parent PDF to read >>> something properly. >>> >>> >>> Is there a concept of "camlistore:dataSource" or something that I should >>> be using? >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "Camlistore" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > best, Eric > eric.pdxhub.org <http://pdxhub.org/people/eric> > > -- best, Eric eric.pdxhub.org <http://pdxhub.org/people/eric> -- You received this message because you are subscribed to the Google Groups "Camlistore" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
