I am adding more info to my post from what I have been looking into...
So, I have found the LinkDbReader and it seems to be able to dump text
out to a file. But, unfortunately, it dumps to a file and I need to
parse it (or I might have missed something). So, if this is the
correct class, that will have to work... Here is a snippet of the
output of the LinkDbReader from a page that I crawled on one of my
test machines, which has apache documentation installed. The output of
the reader is:
<snippet>
http://httpd.apache.org/ Inlinks:
fromUrl: http://nutchdev-1/manual/ anchor: HTTP Server
http://httpd.apache.org/docs-project/ Inlinks:
fromUrl: http://nutchdev-1/manual/ anchor: Documentation
fromUrl: http://nutchdev-1/manual/ anchor:
http://www.apache.org/ Inlinks:
fromUrl: http://nutchdev-1/manual/ anchor: Apache
http://www.apache.org/foundation/preFAQ.html Inlinks:
fromUrl: http://nutchdev-1/ anchor: Apache web server
http://www.apache.org/licenses/LICENSE-2.0 Inlinks:
fromUrl: http://nutchdev-1/manual/ anchor: Apache License, Version 2.0
</snippet>
So, am I to assume that the format shows outlinks first, then the
Inlinks are where the links were found? I'll just have to figure out
the format here so I can parse it. I'll probably write a wrapper that
exports to xml or something to make transformation of this easier.
Anyway, am I on the right track?
Briggs.
On 4/18/07, Briggs <[EMAIL PROTECTED]> wrote:
Is it possible to determine from which domain(s) an outlink was
located? The only way I know how is to limit the crawl to a single
domain (so, I would know where the outlink came from). Also, I am
having difficultly trying to figure out how in 0.9 (probably the same
in 0.8) to easily get the outlinks for my segments. In nutch 0.7.* we
use to do something like:
<snippet>
segmentReader = createSegmentReader(segment);
final FetcherOutput fetcherOutput = new FetcherOutput();
final Content content = new Content();
final ParseData indexParseData = new ParseData();
final ParseText parseText = new ParseText();
while (segmentReader.next(fetcherOutput, content, parseText, indexParseData)) {
extractOutlinksFromParseData(indexParseData, outlinks);
}
</snippet>
<snippet>
private void extractOutlinksFromParseData(final ParseData
indexParseData, final Set<String> outlinks) {
for (final Outlink outlink : indexParseData.getOutlinks()) {
if (null != outlink && outlink.getToUrl() != null) {
outlinks.add(outlink.getToUrl());
}
}
}
</snippet>
I am finally making the plunge and attempting to get this thing (my
application) up to date with the latest and greatest!
Thanks for your time! And once I really get through this code I
promise to start posting answers.
Briggs.
--
"Conscious decisions by conscious minds are what make reality real"
--
"Conscious decisions by concious minds are what make reality real"