Thank you for the hint. How can this be done with the Segment Reader (Nutch
0.9 api)? Thanks in advance.

Cheers,
MyD 



vishal vachhani wrote:
> 
> Simple solution would be done the segments using following command and
> just
> write a script which can extract the Outlinks present in the documents of
> the segment.
> 
> $NUTCH_home/bin/nutch readseg -dump -dir <segDirsPath> -nocontent -nofetch
> -nogenerate -noparse -noparsetext
> 
> this will give you a dump file. run a script and you will get outlinks.
> 
> On Thu, Mar 12, 2009 at 9:45 AM, MyD <myd.ro...@googlemail.com> wrote:
> 
>>
>> Hi @ all,
>>
>> I started to write my own plugin. I extended the HtmlParserFilter to grab
>> outlinks to other pages, but it looks like that the outlinks are just
>> links
>> to css or js files, or am I wrong? What is the best way to extract all
>> outlinks to a url that is not in the domain MY.DOMAIN.NAME? You will find
>> my
>> code below...
>>
>> public class ComputerScienceConferenceHtmlParser implements
>> HtmlParseFilter
>> {
>>
>>        private static final Log LOG =
>> LogFactory.getLog(ComputerScienceConferenceHtmlParser.class.getName());
>>
>>        private Configuration conf;
>>
>>        public Parse filter(Content content, Parse parse, HTMLMetaTags
>> metaTags,
>> DocumentFragment doc) {
>>
>>                ParseData parseData = parse.getData();
>>                Outlink[] outlinks = parseData.getOutlinks();
>>
>>                String text = parse.getText();
>>
>>                LOG.info("ComputerScienceConferenceHtmlParser: " + text);
>>
>>                LOG.warn("BEFORE");
>>                for(int i=0; i<outlinks.length; i++) {
>>                        LOG.warn("Content Base URL: " +
>> content.getBaseUrl());
>>                        LOG.warn("Outlink Anchor: " +
>> outlinks[i].getAnchor());
>>                        LOG.warn("Outlinks ToURL: " +
>> outlinks[i].getToUrl());
>>                        LOG.warn("Outlinks toString(): " +
>> outlinks[i].toString());
>>                        LOG.warn("metaTags: " +
>> metaTags.getBaseHref().toString());
>>                }
>>                LOG.warn("AFTER");
>>
>>                return parse;
>>        }
>>
>>
>>        public void setConf(Configuration conf) {
>>                this.conf = conf;
>>        }
>>
>>        public Configuration getConf() {
>>                return this.conf;
>>        }
>> }
>> --
>> View this message in context:
>> http://www.nabble.com/Pulling-out-URLs-tp22469643p22469643.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Pulling-out-URLs-tp22469643p22474780.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to