[EMAIL PROTECTED] wrote:
Attached is the patch. A contentType is introduced in FetcherOutput,
together with modifications in OutputThread.java, IndexSegment.java
and Fetcher.java (Fetcher.java is no longer used?). Also tweaked are
cached.jsp and search.jsp. As now, nutch will fetch/index text/plain
besides text/html. ContentType will be displayed in search results too.

Overall this looks great.


I have a few concerns:

1. I think that in the fetcher, when the content-type is null, we should assume that it is text/html. In other words, text/html is the default content type. Perhaps this should be configurable, but that should be the default behaviour. It lets us get lots of content we'd otherwise miss.

2. I'm confused by your changes in IndexSegment. It seems to me that text conversion should be done before this point. So, if a page has content-type of text/plain, then it's text should be stored in fetcherText at fetch time and no content-type logic should be needed at index time. Does that make sense?

3. In cached.jsp, what we really want to do is set the content-type header to be what we got when we fetched the page, then use the raw content as it's body. I don't think we can implement this correctly for, e.g., PDF, from a JSP page. We need to write a servlet for that. Perhaps the general way to handle this is to have cached.jsp create a frameset, and have the body frame filled in by a servlet that sets the content type and includes the raw content. In the short term, your hack-upon-a-hack is probably okay here. But, once we start to add more content types, this will need to be addressed.

Doug


------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to