I realized I didn't respond to another interesting bit from your second
message: the idea of marking objects as "derived" from others. I usually
reach for Dublin Core for this kind of stuff. Seems like there could be an
argument for using
http://dublincore.org/documents/dcmi-terms/#terms-hasFormat
(the provenance then establishes it as being the result of OCR vs input by
a human)

On Sun, Mar 19, 2017 at 12:53 AM, Eric Drechsel <[email protected]> wrote:

> The index is mostly used to cache extracted "shallow" metadata from file
> headers etc, and to normalize different metadata representations (EXIF GPS
> tags + Camlistore Location attrs).
>
> It's a poor fit for OCR'd text, not only because it's computationally
> expensive, but (relatedly) because results are generally "inexact" and may
> change depending on the extractor used (local Tesseract model, Google's OCR
> service, or a human reader) or even between runs (models get better with
> training data).
>
> I think this is ultimately a problem of provenance: "who said what when?".
> Thankfully Camlistore makes provenance first class: every attribute change
> is signed by its creator.
>
> The computer vision system could be given its own signing key and store,
> and then just configure the indexer to accept changes from both the user
> and this other source. If updated results become available, the CV system
> can update the record just like a human.
>
> Anyway, right now you just need semantics to express "here is the text
> from this image". It sounds sort of like camliContentImage (
> https://camlistore.org/doc/schema/attributes) which is used to store
> thumbnails. How about camliContentText?
>
> On Sat, Mar 18, 2017 at 10:34 PM, Steve Armstrong <
> [email protected]> wrote:
>
>> I guess what I'm really looking for is "derivative", since in the OCR
>> case, the text is a derivative of the image, and in the PDF case the image
>> is a derivative of a multi-page PDF. Does something like that exist already?
>>
>>
>> On Thursday, March 9, 2017 at 9:44:40 PM UTC-8, Steve Armstrong wrote:
>>>
>>> I'm working a bit on the scanningcabinet port and have two data model
>>> questions that might be related, and might be generic enough for camlistore
>>> in general to be a single "dataSource" predicate:
>>>
>>> 1. How do I store a text blob created from an image
>>>
>>> OCR is expensive (maybe lots of CPU cycles locally, maybe I actually pay
>>> to pass it through a service) so I don't want the text stored in an index
>>> where each server must extract it. A full-text index could search through
>>> these blobs, but I want to only create the blob once across my network of
>>> camlistores. It's also lossy, so the canonical source is still the image,
>>> and that's where the permanode should be.
>>>
>>> 2. How do I store an image extracted from a PDF
>>>
>>> In this case, I'm pulling a PDF apart by generating an image for each
>>> page. The images have permanodes and tags, so they are their own object.
>>> They are lossy though, so I might need to refer to the parent PDF to read
>>> something properly.
>>>
>>>
>>> Is there a concept of "camlistore:dataSource" or something that I should
>>> be using?
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Camlistore" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> best, Eric
> eric.pdxhub.org <http://pdxhub.org/people/eric>
>
>


-- 
best, Eric
eric.pdxhub.org <http://pdxhub.org/people/eric>

-- 
You received this message because you are subscribed to the Google Groups 
"Camlistore" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to