Hi,

I'm trying to use HttpPostAuthentication
<https://wiki.apache.org/nutch/HttpPostAuthentication> with Nutch but it
does not seem to find the login form in this page:
*https://urs.earthdata.nasa.gov
<https://urs.earthdata.nasa.gov>*

*Console output:*

$ bin/nutch parsechecker https://urs.earthdata.nasa.gov
fetching: https://urs.earthdata.nasa.gov
http.proxy.host = null
http.proxy.port = 8080
http.timeout = 12000
http.content.limit = -1
http.agent = AlmohsinNutch........
http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
No form element found with 'id' = login, trying 'name'.
No form element found with 'name' = login
Failed to get protocol output
java.lang.RuntimeException: java.lang.IllegalArgumentException: No form
exists: login
at
org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:470)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:171)
at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:206)
at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:244)
Caused by: java.lang.IllegalArgumentException: No form exists: login
at
org.apache.nutch.protocol.httpclient.HttpFormAuthentication.getLoginFormParams(HttpFormAuthentication.java:183)
at
org.apache.nutch.protocol.httpclient.HttpFormAuthentication.login(HttpFormAuthentication.java:95)
at
org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:468)
... 5 more
Fetch failed with protocol status: exception(16), lastModified=0:
java.lang.RuntimeException: java.lang.IllegalArgumentException: No form
exists: login

-------------------------------------------------------

*hadoop log:*

2015-02-17 11:08:05,903 INFO  parse.ParserChecker - fetching:
https://urs.earthdata.nasa.gov
2015-02-17 11:08:06,121 INFO  httpclient.Http - http.proxy.host = null
2015-02-17 11:08:06,121 INFO  httpclient.Http - http.proxy.port = 8080
2015-02-17 11:08:06,121 INFO  httpclient.Http - http.timeout = 12000
2015-02-17 11:08:06,121 INFO  httpclient.Http - http.content.limit = -1
2015-02-17 11:08:06,121 INFO  httpclient.Http - http.agent =
AlmohsinNutch.......
2015-02-17 11:08:06,121 INFO  httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2015-02-17 11:08:06,121 INFO  httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2015-02-17 11:08:06,966 DEBUG httpclient.HttpFormAuthentication - No form
element found with 'id' = login, trying 'name'.
2015-02-17 11:08:06,973 DEBUG httpclient.HttpFormAuthentication - No form
element found with 'name' = login
2015-02-17 11:08:06,974 ERROR httpclient.Http - Failed to get protocol
output
java.lang.RuntimeException: java.lang.IllegalArgumentException: No form
exists: login
at
org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:470)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:171)
at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:206)
at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:244)
Caused by: java.lang.IllegalArgumentException: No form exists: login
at
org.apache.nutch.protocol.httpclient.HttpFormAuthentication.getLoginFormParams(HttpFormAuthentication.java:183)
at
org.apache.nutch.protocol.httpclient.HttpFormAuthentication.login(HttpFormAuthentication.java:95)
at
org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:468)
... 5 more

-------------------------------------------------------

*httpclient-auth.xml*

<auth-configuration>
   <credentials authMethod="formAuth"
                loginUrl="https://urs.earthdata.nasa.gov";
                loginFormId="login"
                loginRedirect="true">
     <loginPostData>
       <field name="username" value="almohsin"/>
       <field name="password" value="xxxxxxxxxx"/>
        <field name="response_type" value="code"/>
        <field name="stay_in" value="1"/>
        <field name="commit" value="Log+in"/>
        <!-- <field name="authenticity_token"
value="bGB2Tl3zltcCwAG9m7cYR01XsR94SNOUUFVDnw2AMFU%3D"/>
        <field name="client_id" value="wha_aaB97bUw0vn4E982hw"/> -->
        <field name="state" value="%2Fnode"/>
        <field name="redirect_uri" value="https%3A%2F%2Fearthdata.nasa.gov
%2Feosdis%2Furs4%2Fcallback"/>
     </loginPostData>
     <additionalPostHeaders>
       <field name="User-Agent"
              value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:35.0)
Firefox/35.0" />
     </additionalPostHeaders>
   </credentials>
</auth-configuration>

-------------------------------------------------------


I was wondering also if *loginUrl* should be set to the url of the page
containing the auth form (*https://urs.earthdata.nasa.gov
<https://urs.earthdata.nasa.gov>*) or to the form action url where data are
actually posted (*https://urs.earthdata.nasa.gov/login
<https://urs.earthdata.nasa.gov/login>). *The documentation says (loginUrl
- the URL containing the actual <form>) but is it really the case?

I am using latest Nutch 1.10 trunk version that includes NUTCH-827v3 patch
<https://issues.apache.org/jira/browse/NUTCH-827> on latest OS X Yosemite
(10.10.2).

Please let me know if I'm missing something!


Best regards,
Mohammad Al-Mohsin

Reply via email to