Gilbert, Regarding splitting documents up, might I suggest you take a look at a couple of the threads (and all the responses) on the dev mailing lists? http://www.mail-archive.com/[email protected]/msg05412.html http://www.mail-archive.com/[email protected]/msg05374.html
Although these refer to the RSS parser, you could do something similar with PDF or any other parser that produces documents that are to be split and indexed as separate documents. It would seem, however, to require a fair number of changes to the Nutch code. Best regards, Alan _________________________ Alan Tanaman iDNA Solutions http://blog.idna-solutions.com -----Original Message----- From: Gilbert Groenendijk [mailto:[EMAIL PROTECTED] Sent: 07 February 2007 09:53 To: [email protected] Subject: Nutch and fileparsers. HI, Currently i have 2 questions about the fileformat parsers. I would like to know how the PDF parser handles PDF files. Is it possible to split a PDF page by page ? so if you find a match on a specific page, you can go to the matched page like #page=12. The other question is about content 'filtering' What happens if i index a Powerpoint with the header 'CompanyName Presentation'? Basically the word Presentation is irrelevant but the Companyname isn't. It is on every page which gives me 'Garbage' in the index. Someone any thoughts about this? Thanks in advance. -- Gilbert ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
