The index is mostly used to cache extracted "shallow" metadata from file
headers etc, and to normalize different metadata representations (EXIF GPS
tags + Camlistore Location attrs).

It's a poor fit for OCR'd text, not only because it's computationally
expensive, but (relatedly) because results are generally "inexact" and may
change depending on the extractor used (local Tesseract model, Google's OCR
service, or a human reader) or even between runs (models get better with
training data).

I think this is ultimately a problem of provenance: "who said what when?".
Thankfully Camlistore makes provenance first class: every attribute change
is signed by its creator.

The computer vision system could be given its own signing key and store,
and then just configure the indexer to accept changes from both the user
and this other source. If updated results become available, the CV system
can update the record just like a human.

Anyway, right now you just need semantics to express "here is the text from
this image". It sounds sort of like camliContentImage (
https://camlistore.org/doc/schema/attributes) which is used to store
thumbnails. How about camliContentText?

On Sat, Mar 18, 2017 at 10:34 PM, Steve Armstrong <[email protected]
> wrote:

> I guess what I'm really looking for is "derivative", since in the OCR
> case, the text is a derivative of the image, and in the PDF case the image
> is a derivative of a multi-page PDF. Does something like that exist already?
>
>
> On Thursday, March 9, 2017 at 9:44:40 PM UTC-8, Steve Armstrong wrote:
>>
>> I'm working a bit on the scanningcabinet port and have two data model
>> questions that might be related, and might be generic enough for camlistore
>> in general to be a single "dataSource" predicate:
>>
>> 1. How do I store a text blob created from an image
>>
>> OCR is expensive (maybe lots of CPU cycles locally, maybe I actually pay
>> to pass it through a service) so I don't want the text stored in an index
>> where each server must extract it. A full-text index could search through
>> these blobs, but I want to only create the blob once across my network of
>> camlistores. It's also lossy, so the canonical source is still the image,
>> and that's where the permanode should be.
>>
>> 2. How do I store an image extracted from a PDF
>>
>> In this case, I'm pulling a PDF apart by generating an image for each
>> page. The images have permanodes and tags, so they are their own object.
>> They are lossy though, so I might need to refer to the parent PDF to read
>> something properly.
>>
>>
>> Is there a concept of "camlistore:dataSource" or something that I should
>> be using?
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Camlistore" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>



-- 
best, Eric
eric.pdxhub.org <http://pdxhub.org/people/eric>

-- 
You received this message because you are subscribed to the Google Groups 
"Camlistore" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to