Hi, Doug,

On Tue, Mar 02, 2004 at 10:53:34AM -0800, Doug Cutting wrote:
> [EMAIL PROTECTED] wrote:
> >Attached is the patch. A contentType is introduced in FetcherOutput,
> >together with modifications in OutputThread.java, IndexSegment.java
> >and Fetcher.java (Fetcher.java is no longer used?). Also tweaked are
> >cached.jsp and search.jsp. As now, nutch will fetch/index text/plain
> >besides text/html. ContentType will be displayed in search results too.
> 
> Overall this looks great.
> 
> I have a few concerns:
> 
> 1. I think that in the fetcher, when the content-type is null, we should 
> assume that it is text/html.  In other words, text/html is the default 
> content type.  Perhaps this should be configurable, but that should be 
> the default behaviour.  It lets us get lots of content we'd otherwise miss.

I will make text/html as default in
./src/java/net/nutch/fetcher/OutputThread.java

I always use
./src/java/net/nutch/fetcher/OutputThread.java
and never used
./src/java/net/nutch/fetcher/Fetcher.java

Fetcher.java is only patched to make it pass compile.
I vaguely remember, it's mentioned before on the list, that Fetcher.java
predates OutputThread.java and is no longer recommended. Am I wrong?

> 
> 2. I'm confused by your changes in IndexSegment.  It seems to me that 
> text conversion should be done before this point.  So, if a page has 
> content-type of text/plain, then it's text should be stored in 
> fetcherText at fetch time and no content-type logic should be needed at 
> index time.  Does that make sense?

Ideally content-type logic should not appear at index time, I agree.
Then, for text/plain, you will have to keep two idential copies in both
fetcherContent and fetcherText. It doubles disk space usage.
Let me know and I will make the change.

> 3. In cached.jsp, what we really want to do is set the content-type 
> header to be what we got when we fetched the page, then use the raw 
> content as it's body.  I don't think we can implement this correctly 
> for, e.g., PDF, from a JSP page.  We need to write a servlet for that. 
> Perhaps the general way to handle this is to have cached.jsp create a 
> frameset, and have the body frame filled in by a servlet that sets the 
> content type and includes the raw content.  In the short term, your 
> hack-upon-a-hack is probably okay here.  But, once we start to add more 
> content types, this will need to be addressed.

Or provide two links in search output: one for htmlized verion,
the other for raw content with proper content type.

John


-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to