Vacuum Joe wrote:
That's correct. Technically speaking, this is
possible to do (ParseData.getMetadata()), we just didn't decide yet how to treat multiple values under the same key.


I have a good idea of how to handle that situation. If there are multiple and conflicting values for
important meta-data such as the content-type, the page
is horribly broken, and Nutch shouldn't waste effort
trying to figure out what's going on.  For example, if
[..]

I understand your position, and respectfully disagree. I could give you a lot of examples of horribly broken servers (among others some versions of MS IIS), and horribly broken pages that don;t follow any standards - which nonetheless contain valuable content, and Nutch should be able to crawl such sites too.

That is the valuable piece of information.  Is there
some reasonable way to use that as part of a Query, so
I could filter only on certain content types, or give
a higher priority to certain types in the results?

Yes, please see index-more and query-more plugins. This is already implemented.



Please see
http://issues.apache.org/jira/browse/NUTCH-62 for a discussion on this subject.


I'm looking at that.  I have a few other ideas that go
a bit beyond that to make it more useful for
content-oriented searches.

Basically, the content type is a very important piece
of meta-data.  There are only a small number of valid
content types.  A look at a typical mime types file
shows only about 400 types, and of those only about 20
or 30 are in common use for search users.  My idea is
to add two bytes to the page header in the segment
which gives a key to the mime content type, and then
be able to build indexes or otherwise process that
key.

See above - this is already implemented. The link I gave you discusses other metadata, because the content type is stored separately.

I realize this is a big change from the current
direction for handling content types, but to me this
seems like the sanest way to make it useful.  What do
you think?

No change, this is the way we already do it.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to