Nutch and fileparsers.

Gilbert Groenendijk Wed, 07 Feb 2007 01:53:20 -0800

HI,

Currently i have 2 questions about the fileformat parsers. I would like to
know how the PDF parser handles PDF files. Is it possible to split a PDF
page by page ? so if you find a match on a specific page, you can go to the
matched page like #page=12. The other question is about content 'filtering'
What happens if i index a Powerpoint with the header 'CompanyName
Presentation'? Basically the word Presentation is irrelevant but the
Companyname isn't. It is on every page which gives me 'Garbage' in the
index. Someone any thoughts about this? Thanks in advance.


--
Gilbert

Nutch and fileparsers.

Reply via email to