Hi Jan,

If the record is being indexed by Google already, then they should be aware of 
the PDF already, and there's not much DSpace can do to force Google to full 
text index the PDF.  That said, it's worth noting there are two main types of 
PDFs, and only one of which is easily indexed:

  *   PDFs created from digital files or OCRed images.  These PDFs have 
embedded text and are more easily full text indexed.
  *   PDFs created from scanned files (without OCR). These are image-based PDFs 
with no embedded text, and they are often not able to be full text indexed​​, 
unless the system which grabs the PDF is able to OCR it reliably in an 
automatic fashion.

So, if the PDFs you are talking about were created from scanned images, then 
make sure to OCR them so that they are easier to index.

DSpace provides some other hints/tips about Search Engine Optimization here 
which you may want to review for your repository: 
https://wiki.lyrasis.org/display/DSDOC5x/Search+Engine+Optimization

If you have other questions let us know on this list.

Tim

________________________________
From: [email protected] <[email protected]> on 
behalf of Jan Skůpa <[email protected]>
Sent: Friday, September 24, 2021 2:53 AM
To: DSpace Community <[email protected]>
Subject: [dspace-community] fulltext indexing PDF files in Google search

Hi,
I found that most of the PDFs in our dspace (5.3) are not fully searchable via 
Google. The records are indexed, but the phrases from the PDF are not found. Is 
it possible that there is a bug in the settings somewhere? Should this work? 
Thanks!

--
All messages to this mailing list should adhere to the Code of Conduct: 
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups 
"DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-community/c3b24342-a0ef-4946-9576-6ae2b32c55ffn%40googlegroups.com<https://groups.google.com/d/msgid/dspace-community/c3b24342-a0ef-4946-9576-6ae2b32c55ffn%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
All messages to this mailing list should adhere to the Code of Conduct: 
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-community/DM5PR2201MB1148C43C5F25A288F74EAF35EDA49%40DM5PR2201MB1148.namprd22.prod.outlook.com.

Reply via email to