[
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635365#comment-14635365
]
ASF GitHub Bot commented on NUTCH-2062:
---------------------------------------
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/46
NUTCH-2062 - Interactive Selenium Plugin
- Extend lib-selenium to allow for external interaction with the WebDriver.
- Add Interactive Selenium plugin so users can create a Selenium Handler
that does custom interaction with the page being fetched. Handlers are required
to implement a simple interface and then can be included in crawls by adjusting
the configuration.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MJJoyce/nutch NUTCH-2062
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/nutch/pull/46.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #46
----
commit e1be2cf55b06d7e17e83ef74a53587807024adf4
Author: Michael Joyce <[email protected]>
Date: 2015-07-20T16:00:44Z
NUTCH-2062 - lib-selenium interaction extension
- Add ability for lib-selenium to pass off driver handling to caller.
getDriverForPage loads a WebDriver for a given page and returns it to
the caller. getHTMLContent takes a WebDriver and returns the body
content to the caller. These changes will allow a plugin to control
the interaction with the WebDriver to get at the data required for a
particular page.
commit c12eb9ae88d91fd6f9e6dcebd6dc0dd04d12a9ae
Author: Michael Joyce <[email protected]>
Date: 2015-07-20T17:17:49Z
NUTCH-2062 - Add default lib-selenium timeout to config
commit 2df485b1c1a6c5b4df22882f709de4f4c1b6732a
Author: Michael Joyce <[email protected]>
Date: 2015-07-20T17:18:46Z
NUTCH-2062 - Add configurable wait to lib-selenium
- You can now configure the delay that Selenium waits for a page to load
by configuring the libselenium.page.load.delay parameter in
nutch-default. The setting defaults to 3 seconds in lib-selenium if
the parameter isn't available.
commit 8737084752ff8e92c4c4eef668e6ce0ca612f7fb
Author: Michael Joyce <[email protected]>
Date: 2015-07-21T16:16:42Z
Add interactive Selenium plugin
----
> Add Plugin for interacting with Selenium WebDriver
> --------------------------------------------------
>
> Key: NUTCH-2062
> URL: https://issues.apache.org/jira/browse/NUTCH-2062
> Project: Nutch
> Issue Type: Improvement
> Components: plugin
> Affects Versions: 1.10
> Reporter: Michael Joyce
> Fix For: 1.11
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically
> load content. However, I've run into use cases where I need to actively
> interact with a page in Selenium before it becomes useful. For instance, I
> may need to paginate through a table to get all results that I'm interested
> in. This plugin will handle that use case.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)