[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903524#comment-14903524
 ] 

Sebastian Nagel commented on NUTCH-2110:
----------------------------------------

Ok, understood. One point to consider: shall all paginated documents be kept 
under the same URL? As a batch crawler Nutch uses the URL in many places to 
uniquely identify content, meta data, status information, indexed documents, 
etc.  Of course, the outlinks generated for page1 could be modified by adding a 
suffix which makes the URL unique. Only inside protocol-selenium the suffix is 
removed to fetch the right page.

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2110
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2110
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher
>    Affects Versions: 1.10
>            Reporter: Asitang Mishra
>              Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.
> Atleast, this should make nutch capable of distinguishing if a url should be 
> opened using the basic http, httpclient or selenium protocols. And provide 
> the selenium protocol with basic authentication capabilities based on the 
> above ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to