[
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901059#comment-14901059
]
Asitang Mishra commented on NUTCH-2110:
---------------------------------------
Hi Sebastain,
Yes, using the crawldatum is the perfect idea.
This thought came to my mind when we had a use case where: The whole site was
ajax based. So the pagination was also ajax (the url wouldnt change with the
pagination click), so we needed to fetch the whole site in one go. We thought
there must be a way to identify an ajax based resource/page because url was
insufficient. That is when I thought url+a series of selenium interaction info
can be used as a unique identifier in such scenarios.
This is mostly theoretical right now, because things need to be discussed upon
like how the outlinks can be identified for the next fetch (have some ideas
though).
And to answer your last questions. Imagine this scenario: We have a starting
page called page1. There are a bunch of ajax clicks here. We click all of them
the page manipulates and we save all the info into the data of that page. Then
we need to go to the next page, which is still not exactly a different url but
a page interaction. So, we 'somehow' save this for the next round. How do we do
that??. So in the next round we come back to the page1 (cause there is no other
way to page2 if not thru page1 since it does not have a unique url) and this
time we dont go thru all the interaction in page1 and save no data for this
page, but only click the pagination for page2 --> go to page2 and click around
again and save data for it.
> Create the capability to provide seeds in the form of "url+xpath(including
> option to enter seach terms).selenium"
> ------------------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
> Issue Type: Sub-task
> Components: fetcher
> Affects Versions: 1.10
> Reporter: Asitang Mishra
> Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including
> option to enter seach terms).selenium" to be used by selenium
> protocols/plugins as urls/flow to reach to a specific ajax based page or save
> the state of a selenium operation for the next fetching round.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)