Hi, I'm trying to use HttpPostAuthentication <https://wiki.apache.org/nutch/HttpPostAuthentication> with Nutch but it does not seem to find the login form in this page: *https://urs.earthdata.nasa.gov <https://urs.earthdata.nasa.gov>*
*Console output:* $ bin/nutch parsechecker https://urs.earthdata.nasa.gov fetching: https://urs.earthdata.nasa.gov http.proxy.host = null http.proxy.port = 8080 http.timeout = 12000 http.content.limit = -1 http.agent = AlmohsinNutch........ http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 No form element found with 'id' = login, trying 'name'. No form element found with 'name' = login Failed to get protocol output java.lang.RuntimeException: java.lang.IllegalArgumentException: No form exists: login at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:470) at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:171) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:206) at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:136) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:244) Caused by: java.lang.IllegalArgumentException: No form exists: login at org.apache.nutch.protocol.httpclient.HttpFormAuthentication.getLoginFormParams(HttpFormAuthentication.java:183) at org.apache.nutch.protocol.httpclient.HttpFormAuthentication.login(HttpFormAuthentication.java:95) at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:468) ... 5 more Fetch failed with protocol status: exception(16), lastModified=0: java.lang.RuntimeException: java.lang.IllegalArgumentException: No form exists: login ------------------------------------------------------- *hadoop log:* 2015-02-17 11:08:05,903 INFO parse.ParserChecker - fetching: https://urs.earthdata.nasa.gov 2015-02-17 11:08:06,121 INFO httpclient.Http - http.proxy.host = null 2015-02-17 11:08:06,121 INFO httpclient.Http - http.proxy.port = 8080 2015-02-17 11:08:06,121 INFO httpclient.Http - http.timeout = 12000 2015-02-17 11:08:06,121 INFO httpclient.Http - http.content.limit = -1 2015-02-17 11:08:06,121 INFO httpclient.Http - http.agent = AlmohsinNutch....... 2015-02-17 11:08:06,121 INFO httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2015-02-17 11:08:06,121 INFO httpclient.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 2015-02-17 11:08:06,966 DEBUG httpclient.HttpFormAuthentication - No form element found with 'id' = login, trying 'name'. 2015-02-17 11:08:06,973 DEBUG httpclient.HttpFormAuthentication - No form element found with 'name' = login 2015-02-17 11:08:06,974 ERROR httpclient.Http - Failed to get protocol output java.lang.RuntimeException: java.lang.IllegalArgumentException: No form exists: login at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:470) at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:171) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:206) at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:136) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:244) Caused by: java.lang.IllegalArgumentException: No form exists: login at org.apache.nutch.protocol.httpclient.HttpFormAuthentication.getLoginFormParams(HttpFormAuthentication.java:183) at org.apache.nutch.protocol.httpclient.HttpFormAuthentication.login(HttpFormAuthentication.java:95) at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:468) ... 5 more ------------------------------------------------------- *httpclient-auth.xml* <auth-configuration> <credentials authMethod="formAuth" loginUrl="https://urs.earthdata.nasa.gov" loginFormId="login" loginRedirect="true"> <loginPostData> <field name="username" value="almohsin"/> <field name="password" value="xxxxxxxxxx"/> <field name="response_type" value="code"/> <field name="stay_in" value="1"/> <field name="commit" value="Log+in"/> <!-- <field name="authenticity_token" value="bGB2Tl3zltcCwAG9m7cYR01XsR94SNOUUFVDnw2AMFU%3D"/> <field name="client_id" value="wha_aaB97bUw0vn4E982hw"/> --> <field name="state" value="%2Fnode"/> <field name="redirect_uri" value="https%3A%2F%2Fearthdata.nasa.gov %2Feosdis%2Furs4%2Fcallback"/> </loginPostData> <additionalPostHeaders> <field name="User-Agent" value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:35.0) Firefox/35.0" /> </additionalPostHeaders> </credentials> </auth-configuration> ------------------------------------------------------- I was wondering also if *loginUrl* should be set to the url of the page containing the auth form (*https://urs.earthdata.nasa.gov <https://urs.earthdata.nasa.gov>*) or to the form action url where data are actually posted (*https://urs.earthdata.nasa.gov/login <https://urs.earthdata.nasa.gov/login>). *The documentation says (loginUrl - the URL containing the actual <form>) but is it really the case? I am using latest Nutch 1.10 trunk version that includes NUTCH-827v3 patch <https://issues.apache.org/jira/browse/NUTCH-827> on latest OS X Yosemite (10.10.2). Please let me know if I'm missing something! Best regards, Mohammad Al-Mohsin

