Hi Dan, First, it's worth noting there are two general "buckets" that PDFs fall into:
1. PDFs generated from text-based digital content (e.g. Word docs, etc) 2. PDFs generated from scanned physical content In the latter case (scanned content), the PDF's pages are images only, unless you run OCR on them. The act of running them through OCR software enhances the PDF so that it also contains the OCR'd full text alongside the scanned images. DSpace is only able to extract the full text of a PDF if there's textual content that already exists in the PDF. DSpace does not have any built-in OCR capabilities. Therefore, if you have any PDFs which are image-only, they would need to be run through OCR if you want to be able to index & search them in DSpace. (Hint: if you cannot select/copy text in the PDF then it's an image-only PDF and needs to be run through OCR.) - Tim On 10/2/2012 2:43 PM, Daniel Sifton wrote: > Hi > > We’re running Dspace 1.8.2, and are considering implementing full text > indexing for our pdf content. I see the discussion on configuring media > filters at: > > https://wiki.duraspace.org/display/DSDOC18/Configuration#Configuration-ConfiguringMediaFilters > > Can someone tell me if I need to prep these documents first by running > them through some kind of OCR software? The documentation tells me “the > PDF Media Filter will extract textual content from PDF bitstream” which > makes me think the OCR step isn’t necessary . . . or maybe I’m dreaming? > > Thanks, > > Dan > > > > ------------------------------------------------------------------------------ > Don't let slow site performance ruin your business. Deploy New Relic APM > Deploy New Relic app performance management and know exactly > what is happening inside your Ruby, Python, PHP, Java, and .NET app > Try New Relic at no cost today and get our sweet Data Nerd shirt too! > http://p.sf.net/sfu/newrelic-dev2dev > > > > _______________________________________________ > Dspace-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-general > ------------------------------------------------------------------------------ Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev _______________________________________________ Dspace-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-general
