tika config for PDF crawing

Chris Mattmann Wed, 03 Oct 2018 10:09:02 -0700


 

 

From: bineesh k <bineesh13...@gmail.com>
Date: Wednesday, October 3, 2018 at 12:37 AM
To: "dev-ow...@tika.apache.org" <dev-ow...@tika.apache.org>
Subject: Solr/Nutch /tika config for PDF crawing

 

Hello Tika Team, 

 

Need help on Solr/Nutch setup for crawling the PDF pages

 

We are using Nutch 1.15 and Solr 7.3.1 for our setup. We parsed the tika 
details in the nutch-site.xml file ans could crawl the PDF pages and index in 
solr successfully

 

The current issue is title  and description parts are missing for the indexed 
PDF pages. Is there a way to fix this ? if not Can we take first couple of 
lines from the content part and add to title fields ? 

 

Below fields are indexed in sole for PDF pages 

 

"date"

        "type":["application/pdf",

          "application",

          "pdf"],

        "url":

        "content":

        "tstamp":

        "digest":

        "host":

        "boost":

        "contentLength":

        "id":"

        "lastModified":

        "lang":

        "host_str":

        "url_str":

        "lang_str":["en"],

        "digest_str":

        "_version_":1613120835557523457,

        "content_str":,

        "type_str":["application",

          "application/pdf",

          "pdf"]}]

  

 

Thanks in advance for your help on this

 

 

Regards,

Bineesh k

FW: Solr/Nutch /tika config for PDF crawing

Reply via email to