where do we exactly change this NUTCH_JAVA_HOME.
let us say that i installed java in /usr/local/j2sdk1.4.2_03
should i change the nutch file or the java.
and exactly how and where.

thank you


----- Original Message ----- 
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, March 02, 2004 5:26 PM
Subject: Re: [Nutch-dev] recording Content-Type


> [EMAIL PROTECTED] wrote:
> > I always use
> > ./src/java/net/nutch/fetcher/OutputThread.java
> > and never used
> > ./src/java/net/nutch/fetcher/Fetcher.java
> >
> > Fetcher.java is only patched to make it pass compile.
> > I vaguely remember, it's mentioned before on the list, that Fetcher.java
> > predates OutputThread.java and is no longer recommended. Am I wrong?
>
> They have different bugs.  Fetcher.java doesn't observe robots.txt, but
> it is simple and fast.  RequestScheduler.java & friends (including
> OutputThread.java) implement robots.txt plus lots of other politeness
> options, but also frequently hang.  No one has yet fixed this, and the
> fellow who wrote that code is no longer working on Nutch.  Where we go
> depends on where contributors take us: we could add robots.txt support
> to Fetcher.java, or someone could fix the hangs in RequestScheduler.  Or
> someone could contribute an all new fetcher.
>
> Until this is resolved we should probably maintain both.
>
> > Ideally content-type logic should not appear at index time, I agree.
> > Then, for text/plain, you will have to keep two idential copies in both
> > fetcherContent and fetcherText. It doubles disk space usage.
> > Let me know and I will make the change.
>
> I don't think there's enough plain text content out there that this is a
> big problem.  It's also gzipped in both places, for what that's worth.
> And plain text pages don't have any markup overhead, so even if half the
> pages are plain text, they'll still only consume a small percentage of
> the space: much of the space is markup.  So I'm not worried about this.
>
> > Or provide two links in search output: one for htmlized verion,
>
> You mean, for example, convert PDF to HTML?  That's a feature that
> should be added someday, but I don't think of it as an alternative,
> rather an addition.
>
> > the other for raw content with proper content type.
>
> That's the case I think we're talking about.  The question is, should it
> have a header that identifies it as a cache, and potentially highlights
> terms, etc.  Google and Yahoo! both do this.  Or should it just serve up
> the raw content, like the Internet Archive's wayback machine.  We should
> walk before we run, so I think, for now, just having the "cached" button
> bring up the raw content with appropriate content type is probably best,
> no header, no highlighting, no translation, etc.  These other features
> can all be added later, but first we should implement the basic
> functionality.
>
> Doug
>
>
> -------------------------------------------------------
> SF.Net is sponsored by: Speed Start Your Linux Apps Now.
> Build and deploy apps & Web services for Linux with
> a free DVD software kit from IBM. Click Now!
> http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>


-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to