Hi, Mike,

Current nutch cvs has these content parsers:

parse-html
parse-mp3
parse-msword
parse-pdf
parse-rtf
parse-text

It also has parse-ext, which makes it possible to do parsing using
external program (of course, less efficient).

We are always in need of good content parsers. You can help on
(1) improving existing ones through using them
(2) writing new ones.
There are hundreds of mimetypes. Nutch is still lacking parsers
for many important types. As more web contents are in multimedia formats,
it becomes increasingly important for nutch to be able to parse multimedia
types. Interested?

John

On Tue, Dec 07, 2004 at 11:39:02AM -0500, Mike Richmond wrote:
> To Whom It May Concern:
> 
> I am a Java developer looking to get involved with a project.  I came across
> your site and noticed that there is a lot of attention paid to PDF parsing.
> I'm curious why PDF file parsing has not yet been added to Nutch.  There
> seem to be a number of open source (GPL'd) PDF parsers:
> 
> PDFBox (http://pdfbox.org <http://pdfbox.org/> )
> 
> XPDF (http://www.foolabs.com/xpdf/)
> 
> Pdftohtml (http://pdftohtml.sourceforge.net
> <http://pdftohtml.sourceforge.net/> )
> 
> Etc.
> 
>  
> 
> Is there a reason that these are not used, or are you just waiting for
> someone to implement it?
> 
>  
> 
>  
> 
> Regards,
> 
>  
> 
> Mike Richmond
> 
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to