Hi, I am having a problem with certain PDF files and the fragment which is returned when the search is ran. This seems to be an issue when the PDF has little or no text, (just images).
For example, the following was the result of a search for "Insulation": ... Map 8 Noise Exclusion & Insulation Zones - DP47 C78C111C105C115C101C32C69C120C99C108C117C115C105C111C110C32C97C110C100C32C73C110C115C117C108C97C116C105C111C110C32C90C111C110C101C32C45C32C82C65C70C32C76C101C101C109C105C110C103 8 3 ... The long character string is causing layout issues on my site, and I would like to simply remove this. Is there an easy way to do this via XSL, or a way to prevent it being indexed in the first place? Many thanks, Ross FYI - I am using nutch-0.8.1, and have updated the code to use PDFBox-0.7.3 in the hope it would be fixed, but same results -- View this message in context: http://www.nabble.com/nutch-0.8.1---PDF-Fragment-problem-tf3389595.html#a9434973 Sent from the Nutch - User mailing list archive at Nabble.com.