Hi,

don´t know how to solve your PDF-Problem with Nutch, but a simple solution
might be to append "#search=searchterm(s)" to the url of the PDF (e.g. when
searching for "test": http://xyz.com/foundPDF#search=test). This will open
the searchbox of Acrobat Reader. It isn´t working very well but better than
nothing and a very quick patch...

Markus


Gilbert Groenendijk wrote:
> 
> HI,
> 
> Currently i have 2 questions about the fileformat parsers. I would like to
> know how the PDF parser handles PDF files. Is it possible to split a PDF
> page by page ? so if you find a match on a specific page, you can go to
> the
> matched page like #page=12. The other question is about content
> 'filtering'
> What happens if i index a Powerpoint with the header 'CompanyName
> Presentation'? Basically the word Presentation is irrelevant but the
> Companyname isn't. It is on every page which gives me 'Garbage' in the
> index. Someone any thoughts about this? Thanks in advance.
> 
> -- 
> Gilbert
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch-and-fileparsers.-tf3185913.html#a8843108
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to