Hi Nutch User List, I was wondering whether anyone has used Nutch to fetch, parse and index only documents of types other than HTML (e.g. PDF / MS Word / MS Excel etc)?
I've been looking into ways of potentially implementing this. My initial idea was to disable the HTML MIME type in Nutch in order to 'ignore' this type of content. However, it quickly dawned on me that if I don't fetch the HTML pages then I will not be able to get the URL links to other documents contained in the websites specified in my urls-nutch.txt file? The only other option I thought of was to index HTML along with all the other document file types but exclude HTML MIME type from any search results. I guess that this could give me the flexibility of including HTML at a later stage but otherwise leaves me with an index that is be much larger than it 'needs' to be. Is there a way of excluding HTML that I am missing? Does anyone have experience of doing something like this or an opinion they would like to share? Thanks in advance, Jim
