Re: Nutch not crawling all URLs

Sebastian Nagel Thu, 13 Jan 2022 04:32:50 -0800

Hi Roseline,

> Does it work at all with Chrome?


Yes.

> It seems you need to have some form of GUI to run it?

You need graphics libraries but not necessarily a graphical system.
Normally, you run the browser in headless mode without a graphical
device (monitor) attached.

> Is there some documentation or tutorial on this?

The README is probably the best documentation:
  src/plugin/protocol-selenium/README.md
  https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium

After installing chromium and the Selenium chromedriver, you can test whether it
works by running:

bin/nutch parsechecker \
  -Dplugin.includes='protocol-selenium|parse-tika' \
  -Dselenium.grid.binary=/path/to/selenium/chromedriver  \
  -Dselenium.driver=chrome \
  -Dselenium.enable.headless=true \
  -followRedirects -dumpText  URL


Caveat: because browsers are updated frequently, you may need to use a recent
driver version and eventually also upgrade the Selenium dependencies in Nutch.
Let us know if you need help here.


> My use case is Text mining  and Machine Learning classification. I'm indexing
> into Solr and then transferring the indexed data to MongoDB for further
> processing.

Well, that's not an untypical use case for Nutch. And it's a long pipeline:
fetching, HTML parsing, extracting content fields, indexing. Nutch is able to
perform all steps. But I'd agree that browser-based crawling isn't that easy
to set up with Nutch.

Best,
Sebastian

On 1/12/22 17:53, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you. I did enjoy the holiday. Hope you did too. 
> 
> I have had a look at the protocol-selenium plugin, but it was a bit difficult 
> to understand. It appears it only works with Firefox. Does it work at all 
> with Chrome? I was also not sure of what values to set for the properties. It 
> seems you need to have some form of GUI to run it?
> 
> Is there some documentation or tutorial on this? My guess is that some of the 
> pages might not be crawling because of JavaScript. I might be wrong, but 
> would want to test that.
> 
> I think would be quite good for my use case because I am trying to implement 
> broad crawling. 
> 
> My use case is Text mining  and Machine Learning classification. I'm indexing 
> into Solr and then transferring the indexed data to MongoDB for further 
> processing.
> 
> Kind regards,
> Roseline
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> 
> Sent: 12 January 2022 16:12
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> Hi Roseline,
> 
>> the mail below went to my junk folder and I didn't see it.
> 
> No problem. I hope you nevertheless enjoyed the holidays.
> And sorry for any delays but I want to emphasize that Nutch is a community 
> project and in doubt it might take a few days until somebody finds the time 
> to respond.
> 
>> Could you confirm if you received all the urls I sent?
> 
> I've tried a view URLs you sent but not all of them. And to figure out the 
> reason why a site isn't crawled may take some time.
> 
>> Another question I have about Nutch is if it has problems with 
>> crawling javascript pages?
> 
> By default Nutch does not execute Javascript.
> 
> There is a protocol plugin (protocol-selenium) to fetch pages with a web 
> browser between Nutch and the crawled sites. This way Javascript pages can be 
> crawled for the price of some overhead in setting up the crawler and network 
> traffic to fetch the page dependencies (CSS, Javascript, images).
> 
>> I would ideally love to make the crawler work for my URLs than start 
>> checking for other crawlers and waste all the work so far.
> 
> Well, Nutch is for sure a good crawler. But as always: there are many other 
> crawlers which might be better adapted to a specific use case.
> 
> What's your use case? Indexing into Solr or Elasticsearch?
> Text mining? Archiving content?
> 
> Best,
> Sebastian
> 
> On 1/12/22 12:13, Roseline Antai wrote:
>> Hi Sebastian,
>>
>> For some reason, the mail below went to my junk folder and I didn't see it.
>>
>> The notco page - 
>> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2F&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2BZq7R6H954Q9u6Xt%2FnkeHYEKjx4rhFF62PvP2dQEW5U%3D&amp;reserved=0
>>   was not indexed, no. When I enabled redirects, I was able to get a few 
>> pages, but they don't seem valid.
>>
>> Could you confirm if you received all the urls I sent?
>>
>> Another question I have about Nutch is if it has problems with crawling 
>> javascript pages?
>>
>> I would ideally love to make the crawler work for my URLs than start 
>> checking for other crawlers and waste all the work so far.
>>
>> Just adding again, this is what my nutch-site.xml looks like:
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>  <name>http.agent.name</name>
>>  <value>Nutch Crawler</value>
>> </property>
>> <property>
>> <name>http.agent.email</name>                         
>> <value>datalake.ng at gmail d</value> </property> <property>
>>     <name>db.ignore.internal.links</name>
>>     <value>false</value>
>> </property>
>> <property>
>>     <name>db.ignore.external.links</name>
>>     <value>true</value>
>> </property>
>> <property>
>>   <name>plugin.includes</name>
>>   
>> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
>> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu
>> age-identifier</value>
>> </property>
>> <property>
>>     <name>parser.skip.truncated</name>
>>     <value>false</value>
>>     <description>Boolean value for whether we should skip parsing for 
>> truncated documents. By default this
>>         property is activated due to extremely high levels of CPU which 
>> parsing can sometimes take.
>>     </description>
>> </property>
>>  <property>
>>    <name>db.max.outlinks.per.page</name>
>>    <value>-1</value>
>>    <description>The maximum number of outlinks that we'll process for a page.
>>    If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
>> outlinks
>>    will be processed for a page; otherwise, all outlinks will be processed.
>>    </description>
>>  </property>
>> <property>
>>   <name>http.content.limit</name>
>>   <value>-1</value>
>>   <description>The length limit for downloaded content using the http://
>>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>>   than it will be truncated; otherwise, no truncation at all. Do not
>>   confuse this setting with the file.content.limit setting.
>>   </description>
>> </property>
>> <property>
>>   <name>db.ignore.external.links.mode</name>
>>   <value>byHost</value>
>> </property>
>> <property>
>>   <name>db.injector.overwrite</name>
>>   <value>true</value>
>> </property>
>> <property>
>>   <name>http.timeout</name>
>>   <value>50000</value>
>>   <description>The default network timeout, in 
>> milliseconds.</description> </property> </configuration>
>>
>> Regards,
>> Roseline
>>
>> -----Original Message-----
>> From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID>
>> Sent: 13 December 2021 17:35
>> To: user@nutch.apache.org
>> Subject: Re: Nutch not crawling all URLs
>>
>> CAUTION: This email originated outside the University. Check before clicking 
>> links or attachments.
>>
>> Hi Roseline,
>>
>>> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A
>>> %2F%2Fwww.notco.com%2F&amp;data=04%7C01%7Croseline.antai%40strath.ac.
>>> uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee59
>>> 44e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4w
>>> LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sda
>>> ta=uUPYYLqNHBFSDozeSLODQTFwJiVJu7EPdccRlsMalE0%3D&amp;reserved=0
>>
>> What is the status for   
>> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhich&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=4%2FkRRs6KQMV7LP6y0cOTdRyTcbtHSu5iRaekyhVyu28%3D&amp;reserved=0
>>  is the final redirect
>> target?
>> Is the target page indexed?
>>
>> ~Sebastian
>>

Re: Nutch not crawling all URLs

Reply via email to