Hi Kamil, hi Markus,
upgrading the Selenium plugin is very appreciated!
> Besides that, the plugin also needs some overhaul.
Definitely.
> It currently first downloads the URL with HttpClient, and then, depending on
> MIME-type, it may or may not forward the URL to Selenium so it can be
> downloaded again.
This makes some sense if you do not know anything about the URL.
- a HEAD request could do almost the same
- often one knows whether there are only HTML pages or also PDFs, zip files,
and other stuff not suitable for Selenium. Could make the HEAD request
optional.
> merging the lib-selenium plugin with the protocol-selenium plugin
I guess lib-selenium is to share common components between protocol-selenium and
protocol-interactiveselenium. Maybe merge all three? Or skip interactiveselenium
for now.
~Sebastian
On 1/17/23 19:56, Markus Jelsma wrote:
Hello Kamil,
Yes, the plugin needs some upgrading indeed. We use a modern version of it
elsewhere and it works really well, at least better than HtmlUnit.
Besides that, the plugin also needs some overhaul. It currently first downloads
the URL with HttpClient, and then, depending on MIME-type, it may or may not
forward the URL to Selenium so it can be downloaded again.
There is a lot of code in the plugin that should be removed. I would also opt
for merging the lib-selenium plugin with the protocol-selenium plugin. There is
no obvious need for having it separated.
These can be, of course, separate tasks.
Regards,
Markus
Op di 17 jan. 2023 om 17:49 schreef Kamil Mroczek <[email protected]>:
Hello,
I am sending a message to inquire whether I should submit a patch which
updates selenium to the latest version. Although it is a major version
upgrade to the library, very few code changes were needed to update.
For a preview of the changes I made you can look here
<https://github.com/Elio-Earth/nutch/commit/9960f14bce0f0d6cebc406556a298a7c8c2e6b9f>.
Although not used in the code anymore (it was commented out), PhantomJS support has
been removed from Selenium in the latest version. The commit also removes Opera since
it was commented out but I can leave that in if needed. The build and tests pass. I
have been using the Chrome driver successfully with it and would just need to run a
quick test with Firefox to make sure it works too.
I have only been using Nutch for about a month but have spent quite a bit of
time looking over different parts of the code to understand how to configure
it and change it.
Kamil