Stefan Neufeind wrote:
Sami Siren wrote:
Stefan Neufeind wrote:
Sami Siren wrote:
redirecting to nutch-user...
What I currently have is that max. 2 matches are shown per website -
but
that also from the summary-website only 2 matches are shown. Either I'd
need to be able to show only 2 matches per website but _all_ matches
from the summary-website (would be okay in this case) or give website 1
to 4 individual "IDs per website" and also assign each URL from the
summary-website the corresponding ID of the website it belongs to.
You can add whatever (meta-)data to index with indexing filter. You
could
for example assign tag "A" to site A, tag "B" to B etc...
then assign unique tags for pages from summary site.
In searching phase you then use that new field as dedupfield (instead of
site)
This should give you max (for example 2) hits per website and unlimited
hits
from summary web site.
Does that fullfill your requirements?
That would perfectly fit, yes. But how do I "tag" the pages/URLs? With
what "filter"?
Write a plugin that provides implementation of
http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/indexer/IndexingFilter.html
That was (part of) my question - how to do that "cleanly", and if
somebody could give a hint. I'm not sure what would be the elegant way
of having a "match URL against ... and set tags ABC"-patternfile, how to
use a hash-map or something for that and how to do it in Java. (Sorry,
I'm not that familiar with Java as with other languages, and neither
with nutch-internals).
If it's a relatively short list of urls (let's say less than 50,000
entries) then you can use org.apache.nutch.util.PrefixStringMatcher,
which builds a compact trie structure. I would then strongly advise you
to keep just the urls (or whatever it is that you need to match) in that
structure, and all other data in an external DB or a special-purpose
Lucene index. You can implement this as an indexing plugin - if the
pattern matches, then you get additional metadata from some external
source, and you add additional fields to the index that contain this data.
I implemented several plugins, which used this trick, and it works very
well.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com