Re: Pulling out URLs

Gosavi.Shyam Thu, 12 Mar 2009 05:46:18 -0700

hi
try this command 
bin/nutch readseg <segment_dir> <output>

(i.e bin/nutch readseg ./crawldir/segments/* output.log


Regards
sanjshra


MyD wrote:
> 
> Thank you for the hint. How can this be done with the Segment Reader
> (Nutch 0.9 api)? Thanks in advance.
> 
> Cheers,
> MyD 
> 
> 
> 
> vishal vachhani wrote:
>> 
>> Simple solution would be done the segments using following command and
>> just
>> write a script which can extract the Outlinks present in the documents of
>> the segment.
>> 
>> $NUTCH_home/bin/nutch readseg -dump -dir <segDirsPath> -nocontent
>> -nofetch
>> -nogenerate -noparse -noparsetext
>> 
>> this will give you a dump file. run a script and you will get outlinks.
>> 
>> On Thu, Mar 12, 2009 at 9:45 AM, MyD <myd.ro...@googlemail.com> wrote:
>> 
>>>
>>> Hi @ all,
>>>
>>> I started to write my own plugin. I extended the HtmlParserFilter to
>>> grab
>>> outlinks to other pages, but it looks like that the outlinks are just
>>> links
>>> to css or js files, or am I wrong? What is the best way to extract all
>>> outlinks to a url that is not in the domain MY.DOMAIN.NAME? You will
>>> find
>>> my
>>> code below...
>>>
>>> public class ComputerScienceConferenceHtmlParser implements
>>> HtmlParseFilter
>>> {
>>>
>>>        private static final Log LOG =
>>> LogFactory.getLog(ComputerScienceConferenceHtmlParser.class.getName());
>>>
>>>        private Configuration conf;
>>>
>>>        public Parse filter(Content content, Parse parse, HTMLMetaTags
>>> metaTags,
>>> DocumentFragment doc) {
>>>
>>>                ParseData parseData = parse.getData();
>>>                Outlink[] outlinks = parseData.getOutlinks();
>>>
>>>                String text = parse.getText();
>>>
>>>                LOG.info("ComputerScienceConferenceHtmlParser: " + text);
>>>
>>>                LOG.warn("BEFORE");
>>>                for(int i=0; i<outlinks.length; i++) {
>>>                        LOG.warn("Content Base URL: " +
>>> content.getBaseUrl());
>>>                        LOG.warn("Outlink Anchor: " +
>>> outlinks[i].getAnchor());
>>>                        LOG.warn("Outlinks ToURL: " +
>>> outlinks[i].getToUrl());
>>>                        LOG.warn("Outlinks toString(): " +
>>> outlinks[i].toString());
>>>                        LOG.warn("metaTags: " +
>>> metaTags.getBaseHref().toString());
>>>                }
>>>                LOG.warn("AFTER");
>>>
>>>                return parse;
>>>        }
>>>
>>>
>>>        public void setConf(Configuration conf) {
>>>                this.conf = conf;
>>>        }
>>>
>>>        public Configuration getConf() {
>>>                return this.conf;
>>>        }
>>> }
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Pulling-out-URLs-tp22469643p22469643.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>>
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Pulling-out-URLs-tp22469643p22475608.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Pulling out URLs

Reply via email to