Hi Roseline, > Does it work at all with Chrome?
Yes. > It seems you need to have some form of GUI to run it? You need graphics libraries but not necessarily a graphical system. Normally, you run the browser in headless mode without a graphical device (monitor) attached. > Is there some documentation or tutorial on this? The README is probably the best documentation: src/plugin/protocol-selenium/README.md https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium After installing chromium and the Selenium chromedriver, you can test whether it works by running: bin/nutch parsechecker \ -Dplugin.includes='protocol-selenium|parse-tika' \ -Dselenium.grid.binary=/path/to/selenium/chromedriver \ -Dselenium.driver=chrome \ -Dselenium.enable.headless=true \ -followRedirects -dumpText URL Caveat: because browsers are updated frequently, you may need to use a recent driver version and eventually also upgrade the Selenium dependencies in Nutch. Let us know if you need help here. > My use case is Text mining and Machine Learning classification. I'm indexing > into Solr and then transferring the indexed data to MongoDB for further > processing. Well, that's not an untypical use case for Nutch. And it's a long pipeline: fetching, HTML parsing, extracting content fields, indexing. Nutch is able to perform all steps. But I'd agree that browser-based crawling isn't that easy to set up with Nutch. Best, Sebastian On 1/12/22 17:53, Roseline Antai wrote: > Hi Sebastian, > > Thank you. I did enjoy the holiday. Hope you did too. > > I have had a look at the protocol-selenium plugin, but it was a bit difficult > to understand. It appears it only works with Firefox. Does it work at all > with Chrome? I was also not sure of what values to set for the properties. It > seems you need to have some form of GUI to run it? > > Is there some documentation or tutorial on this? My guess is that some of the > pages might not be crawling because of JavaScript. I might be wrong, but > would want to test that. > > I think would be quite good for my use case because I am trying to implement > broad crawling. > > My use case is Text mining and Machine Learning classification. I'm indexing > into Solr and then transferring the indexed data to MongoDB for further > processing. > > Kind regards, > Roseline > > > > > > -----Original Message----- > From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> > Sent: 12 January 2022 16:12 > To: user@nutch.apache.org > Subject: Re: Nutch not crawling all URLs > > Hi Roseline, > >> the mail below went to my junk folder and I didn't see it. > > No problem. I hope you nevertheless enjoyed the holidays. > And sorry for any delays but I want to emphasize that Nutch is a community > project and in doubt it might take a few days until somebody finds the time > to respond. > >> Could you confirm if you received all the urls I sent? > > I've tried a view URLs you sent but not all of them. And to figure out the > reason why a site isn't crawled may take some time. > >> Another question I have about Nutch is if it has problems with >> crawling javascript pages? > > By default Nutch does not execute Javascript. > > There is a protocol plugin (protocol-selenium) to fetch pages with a web > browser between Nutch and the crawled sites. This way Javascript pages can be > crawled for the price of some overhead in setting up the crawler and network > traffic to fetch the page dependencies (CSS, Javascript, images). > >> I would ideally love to make the crawler work for my URLs than start >> checking for other crawlers and waste all the work so far. > > Well, Nutch is for sure a good crawler. But as always: there are many other > crawlers which might be better adapted to a specific use case. > > What's your use case? Indexing into Solr or Elasticsearch? > Text mining? Archiving content? > > Best, > Sebastian > > On 1/12/22 12:13, Roseline Antai wrote: >> Hi Sebastian, >> >> For some reason, the mail below went to my junk folder and I didn't see it. >> >> The notco page - >> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2F&data=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BZq7R6H954Q9u6Xt%2FnkeHYEKjx4rhFF62PvP2dQEW5U%3D&reserved=0 >> was not indexed, no. When I enabled redirects, I was able to get a few >> pages, but they don't seem valid. >> >> Could you confirm if you received all the urls I sent? >> >> Another question I have about Nutch is if it has problems with crawling >> javascript pages? >> >> I would ideally love to make the crawler work for my URLs than start >> checking for other crawlers and waste all the work so far. >> >> Just adding again, this is what my nutch-site.xml looks like: >> <?xml version="1.0"?> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> >> <!-- Put site-specific property overrides in this file. --> >> >> <configuration> >> <property> >> <name>http.agent.name</name> >> <value>Nutch Crawler</value> >> </property> >> <property> >> <name>http.agent.email</name> >> <value>datalake.ng at gmail d</value> </property> <property> >> <name>db.ignore.internal.links</name> >> <value>false</value> >> </property> >> <property> >> <name>db.ignore.external.links</name> >> <value>true</value> >> </property> >> <property> >> <name>plugin.includes</name> >> >> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an >> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu >> age-identifier</value> >> </property> >> <property> >> <name>parser.skip.truncated</name> >> <value>false</value> >> <description>Boolean value for whether we should skip parsing for >> truncated documents. By default this >> property is activated due to extremely high levels of CPU which >> parsing can sometimes take. >> </description> >> </property> >> <property> >> <name>db.max.outlinks.per.page</name> >> <value>-1</value> >> <description>The maximum number of outlinks that we'll process for a page. >> If this value is nonnegative (>=0), at most db.max.outlinks.per.page >> outlinks >> will be processed for a page; otherwise, all outlinks will be processed. >> </description> >> </property> >> <property> >> <name>http.content.limit</name> >> <value>-1</value> >> <description>The length limit for downloaded content using the http:// >> protocol, in bytes. If this value is nonnegative (>=0), content longer >> than it will be truncated; otherwise, no truncation at all. Do not >> confuse this setting with the file.content.limit setting. >> </description> >> </property> >> <property> >> <name>db.ignore.external.links.mode</name> >> <value>byHost</value> >> </property> >> <property> >> <name>db.injector.overwrite</name> >> <value>true</value> >> </property> >> <property> >> <name>http.timeout</name> >> <value>50000</value> >> <description>The default network timeout, in >> milliseconds.</description> </property> </configuration> >> >> Regards, >> Roseline >> >> -----Original Message----- >> From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> >> Sent: 13 December 2021 17:35 >> To: user@nutch.apache.org >> Subject: Re: Nutch not crawling all URLs >> >> CAUTION: This email originated outside the University. Check before clicking >> links or attachments. >> >> Hi Roseline, >> >>> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A >>> %2F%2Fwww.notco.com%2F&data=04%7C01%7Croseline.antai%40strath.ac. >>> uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee59 >>> 44e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4w >>> LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sda >>> ta=uUPYYLqNHBFSDozeSLODQTFwJiVJu7EPdccRlsMalE0%3D&reserved=0 >> >> What is the status for >> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhich&data=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4%2FkRRs6KQMV7LP6y0cOTdRyTcbtHSu5iRaekyhVyu28%3D&reserved=0 >> is the final redirect >> target? >> Is the target page indexed? >> >> ~Sebastian >>