Hi, Mike, Current nutch cvs has these content parsers:
parse-html parse-mp3 parse-msword parse-pdf parse-rtf parse-text It also has parse-ext, which makes it possible to do parsing using external program (of course, less efficient). We are always in need of good content parsers. You can help on (1) improving existing ones through using them (2) writing new ones. There are hundreds of mimetypes. Nutch is still lacking parsers for many important types. As more web contents are in multimedia formats, it becomes increasingly important for nutch to be able to parse multimedia types. Interested? John On Tue, Dec 07, 2004 at 11:39:02AM -0500, Mike Richmond wrote: > To Whom It May Concern: > > I am a Java developer looking to get involved with a project. I came across > your site and noticed that there is a lot of attention paid to PDF parsing. > I'm curious why PDF file parsing has not yet been added to Nutch. There > seem to be a number of open source (GPL'd) PDF parsers: > > PDFBox (http://pdfbox.org <http://pdfbox.org/> ) > > XPDF (http://www.foolabs.com/xpdf/) > > Pdftohtml (http://pdftohtml.sourceforge.net > <http://pdftohtml.sourceforge.net/> ) > > Etc. > > > > Is there a reason that these are not used, or are you just waiting for > someone to implement it? > > > > > > Regards, > > > > Mike Richmond > __________________________________________ http://www.neasys.com - A Good Place to Be Come to visit us today! ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
