For the sake of archival and support purposes, below is an email exchange I had 
with Tim regarding PDFs that contain unclean OCR text and Discovery search 
results.  I accidentally replied to him directly rather than to the list.

--

Tim,

Is there a way to "hide" blurbs where OCR data isn't clean and presents 
artifacts in Discovery search results?  To rephrase, is there some sort of 
identifier or flag that can be set so that only specific Items don't return 
text?

Our Digital Initiatives group just completed a huge project, but the documents 
were originally OCRed by an outside agent.  I'm seeing artifacts in some PDFs, 
as well, where alongside text there appears to be document structure 
information. The result are question marks, diamond characters, etc. 


Regards,

-Jeff 

--

Jeff,

You can use the MediaFilter's "skip mode" to skip indexing specific PDFs (if 
the index process isn't working right or pulling out invalid text).

The skip mode can even take in a file which has a comma-separated list of 
handles to skip.

See the docs at 
https://wiki.duraspace.org/display/DSDOC5x/Mediafilters+for+Transforming+DSpace+Content#MediafiltersforTransformingDSpaceContent-Executing%28viaCommandLine%29


- Tim

-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Reply via email to