Re: [Nutch-general] Nutch and fileparsers.

Alan Tanaman Thu, 08 Feb 2007 00:54:12 -0800

Gilbert,

Regarding splitting documents up, might I suggest you take a look at a
couple of the threads (and all the responses) on the dev mailing lists?
http://www.mail-archive.com/[email protected]/msg05412.html
http://www.mail-archive.com/[email protected]/msg05374.html


Although these refer to the RSS parser, you could do something similar with
PDF or any other parser that produces documents that are to be split and
indexed as separate documents.  It would seem, however, to require a fair
number of changes to the Nutch code.

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
http://blog.idna-solutions.com

-----Original Message-----
From: Gilbert Groenendijk [mailto:[EMAIL PROTECTED] 
Sent: 07 February 2007 09:53
To: [email protected]
Subject: Nutch and fileparsers.

HI,

Currently i have 2 questions about the fileformat parsers. I would like to
know how the PDF parser handles PDF files. Is it possible to split a PDF
page by page ? so if you find a match on a specific page, you can go to the
matched page like #page=12. The other question is about content 'filtering'
What happens if i index a Powerpoint with the header 'CompanyName
Presentation'? Basically the word Presentation is irrelevant but the
Companyname isn't. It is on every page which gives me 'Garbage' in the
index. Someone any thoughts about this? Thanks in advance.

-- 
Gilbert


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch and fileparsers.

Reply via email to