where do we exactly change this NUTCH_JAVA_HOME. let us say that i installed java in /usr/local/j2sdk1.4.2_03 should i change the nutch file or the java. and exactly how and where.
thank you ----- Original Message ----- From: "Doug Cutting" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, March 02, 2004 5:26 PM Subject: Re: [Nutch-dev] recording Content-Type > [EMAIL PROTECTED] wrote: > > I always use > > ./src/java/net/nutch/fetcher/OutputThread.java > > and never used > > ./src/java/net/nutch/fetcher/Fetcher.java > > > > Fetcher.java is only patched to make it pass compile. > > I vaguely remember, it's mentioned before on the list, that Fetcher.java > > predates OutputThread.java and is no longer recommended. Am I wrong? > > They have different bugs. Fetcher.java doesn't observe robots.txt, but > it is simple and fast. RequestScheduler.java & friends (including > OutputThread.java) implement robots.txt plus lots of other politeness > options, but also frequently hang. No one has yet fixed this, and the > fellow who wrote that code is no longer working on Nutch. Where we go > depends on where contributors take us: we could add robots.txt support > to Fetcher.java, or someone could fix the hangs in RequestScheduler. Or > someone could contribute an all new fetcher. > > Until this is resolved we should probably maintain both. > > > Ideally content-type logic should not appear at index time, I agree. > > Then, for text/plain, you will have to keep two idential copies in both > > fetcherContent and fetcherText. It doubles disk space usage. > > Let me know and I will make the change. > > I don't think there's enough plain text content out there that this is a > big problem. It's also gzipped in both places, for what that's worth. > And plain text pages don't have any markup overhead, so even if half the > pages are plain text, they'll still only consume a small percentage of > the space: much of the space is markup. So I'm not worried about this. > > > Or provide two links in search output: one for htmlized verion, > > You mean, for example, convert PDF to HTML? That's a feature that > should be added someday, but I don't think of it as an alternative, > rather an addition. > > > the other for raw content with proper content type. > > That's the case I think we're talking about. The question is, should it > have a header that identifies it as a cache, and potentially highlights > terms, etc. Google and Yahoo! both do this. Or should it just serve up > the raw content, like the Internet Archive's wayback machine. We should > walk before we run, so I think, for now, just having the "cached" button > bring up the raw content with appropriate content type is probably best, > no header, no highlighting, no translation, etc. These other features > can all be added later, but first we should implement the basic > functionality. > > Doug > > > ------------------------------------------------------- > SF.Net is sponsored by: Speed Start Your Linux Apps Now. > Build and deploy apps & Web services for Linux with > a free DVD software kit from IBM. Click Now! > http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click > _______________________________________________ > Nutch-developers mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
