Re: [Nutch-dev] recording Content-Type

Doug Cutting Tue, 02 Mar 2004 14:29:49 -0800

[EMAIL PROTECTED] wrote:

I always use
./src/java/net/nutch/fetcher/OutputThread.java
and never used
./src/java/net/nutch/fetcher/Fetcher.java

Fetcher.java is only patched to make it pass compile.
I vaguely remember, it's mentioned before on the list, that Fetcher.java
predates OutputThread.java and is no longer recommended. Am I wrong?

They have different bugs. Fetcher.java doesn't observe robots.txt, but it is simple and fast. RequestScheduler.java & friends (including OutputThread.java) implement robots.txt plus lots of other politeness options, but also frequently hang. No one has yet fixed this, and the fellow who wrote that code is no longer working on Nutch. Where we go depends on where contributors take us: we could add robots.txt support to Fetcher.java, or someone could fix the hangs in RequestScheduler. Or someone could contribute an all new fetcher.

Until this is resolved we should probably maintain both.

Ideally content-type logic should not appear at index time, I agree.
Then, for text/plain, you will have to keep two idential copies in both
fetcherContent and fetcherText. It doubles disk space usage.
Let me know and I will make the change.

I don't think there's enough plain text content out there that this is a big problem. It's also gzipped in both places, for what that's worth. And plain text pages don't have any markup overhead, so even if half the pages are plain text, they'll still only consume a small percentage of the space: much of the space is markup. So I'm not worried about this.

Or provide two links in search output: one for htmlized verion,

You mean, for example, convert PDF to HTML? That's a feature that should be added someday, but I don't think of it as an alternative, rather an addition.

the other for raw content with proper content type.

That's the case I think we're talking about. The question is, should it have a header that identifies it as a cache, and potentially highlights terms, etc. Google and Yahoo! both do this. Or should it just serve up the raw content, like the Internet Archive's wayback machine. We should walk before we run, so I think, for now, just having the "cached" button bring up the raw content with appropriate content type is probably best, no header, no highlighting, no translation, etc. These other features can all be added later, but first we should implement the basic functionality.

Doug


-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] recording Content-Type

Reply via email to