Hi Dan,

First, it's worth noting there are two general "buckets" that PDFs fall 
into:

1. PDFs generated from text-based digital content (e.g. Word docs, etc)
2. PDFs generated from scanned physical content

In the latter case (scanned content), the PDF's pages are images only, 
unless you run OCR on them. The act of running them through OCR software 
enhances the PDF so that it also contains the OCR'd full text alongside 
the scanned images.

DSpace is only able to extract the full text of a PDF if there's textual 
content that already exists in the PDF. DSpace does not have any 
built-in OCR capabilities.

Therefore, if you have any PDFs which are image-only, they would need to 
be run through OCR if you want to be able to index & search them in 
DSpace. (Hint: if you cannot select/copy text in the PDF then it's an 
image-only PDF and needs to be run through OCR.)

- Tim

On 10/2/2012 2:43 PM, Daniel Sifton wrote:
> Hi
>
> We’re running Dspace 1.8.2, and are considering implementing full text
> indexing for our pdf content. I see the discussion on configuring media
> filters at:
>
> https://wiki.duraspace.org/display/DSDOC18/Configuration#Configuration-ConfiguringMediaFilters
>
> Can someone tell me if I need to prep these documents first by running
> them through some kind of OCR software? The documentation tells me “the
> PDF Media Filter will extract textual content from PDF bitstream” which
> makes me think the OCR step isn’t necessary . . . or maybe I’m dreaming?
>
> Thanks,
>
> Dan
>
>
>
> ------------------------------------------------------------------------------
> Don't let slow site performance ruin your business. Deploy New Relic APM
> Deploy New Relic app performance management and know exactly
> what is happening inside your Ruby, Python, PHP, Java, and .NET app
> Try New Relic at no cost today and get our sweet Data Nerd shirt too!
> http://p.sf.net/sfu/newrelic-dev2dev
>
>
>
> _______________________________________________
> Dspace-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-general
>

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dspace-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-general

Reply via email to