Say there is only one entry to list all content of the website:
http://a.com/search?city=YourCity. (We take it as search engine on the
website of course.)
If I input "YourCity"'s value as "NewYork", then it will list all
content related with NewYork, and there are many pages. And the
pagination URL we expect is
http://a.com/search?city=NewYork&pg=1 for the first page
http://a.com/search?city=NewYork&pg=2 for the second page.

Nutch crawler is not "trapped" in this case.
True.

But if we change the website design, it will be "trapped".
After the request "http://a.com/search?city=NewYork"; is sent, the
website store the query parameters into cookie/session first, then
redirect to another page, say http://a.com/next.html, and response it.
Also the pagination URLs are changed. The query parameters are
discarded, since all of them can be retrieved from cookie/session. The
final pagination URL will be
http://a.com/search?pg=1 for the first page
http://a.com/search?pg=2 for the second page.

If nutch crawler extract URL from HTML document, it definitely cannot
get the desired content in parsing, right?
Sometimes, what information is available to you is determined by the decision of whoever designed the page right? It the page tries to be smart and 'determine' what user want's to see, well, if you don't own that webpage there isn't anything you can do.

Best regards,
EM

Reply via email to