Hi, Doug, On Tue, Mar 02, 2004 at 10:53:34AM -0800, Doug Cutting wrote: > [EMAIL PROTECTED] wrote: > >Attached is the patch. A contentType is introduced in FetcherOutput, > >together with modifications in OutputThread.java, IndexSegment.java > >and Fetcher.java (Fetcher.java is no longer used?). Also tweaked are > >cached.jsp and search.jsp. As now, nutch will fetch/index text/plain > >besides text/html. ContentType will be displayed in search results too. > > Overall this looks great. > > I have a few concerns: > > 1. I think that in the fetcher, when the content-type is null, we should > assume that it is text/html. In other words, text/html is the default > content type. Perhaps this should be configurable, but that should be > the default behaviour. It lets us get lots of content we'd otherwise miss.
I will make text/html as default in ./src/java/net/nutch/fetcher/OutputThread.java I always use ./src/java/net/nutch/fetcher/OutputThread.java and never used ./src/java/net/nutch/fetcher/Fetcher.java Fetcher.java is only patched to make it pass compile. I vaguely remember, it's mentioned before on the list, that Fetcher.java predates OutputThread.java and is no longer recommended. Am I wrong? > > 2. I'm confused by your changes in IndexSegment. It seems to me that > text conversion should be done before this point. So, if a page has > content-type of text/plain, then it's text should be stored in > fetcherText at fetch time and no content-type logic should be needed at > index time. Does that make sense? Ideally content-type logic should not appear at index time, I agree. Then, for text/plain, you will have to keep two idential copies in both fetcherContent and fetcherText. It doubles disk space usage. Let me know and I will make the change. > 3. In cached.jsp, what we really want to do is set the content-type > header to be what we got when we fetched the page, then use the raw > content as it's body. I don't think we can implement this correctly > for, e.g., PDF, from a JSP page. We need to write a servlet for that. > Perhaps the general way to handle this is to have cached.jsp create a > frameset, and have the body frame filled in by a servlet that sets the > content type and includes the raw content. In the short term, your > hack-upon-a-hack is probably okay here. But, once we start to add more > content types, this will need to be addressed. Or provide two links in search output: one for htmlized verion, the other for raw content with proper content type. John ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
