I am working on solr for the first time and got the setup done. 
Now I have created a core using command line and want to perform webcrawl of a 
third party site.
If I try it with individual links, I am able to do the crawl and index it to 
the core.This was done using >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar 
http://www.example.com

Now what I intend to do is to give a url and using the recursive option 
(-Drecursive) and let it crawl the entire site.
Note that I am pointing to a website that has around 125 pages and I am using 
the below command >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes 
-jar post.jar http://www.example.com  and
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 -jar 
post.jar http://www.example.com

and I am getting the below error message.
Error:


POSTed web resource http://www.example.com (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException: 
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not 
allowed in prolog.
        at 
org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
        at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
        at 
org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
        at 
org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
        at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
        at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; 
Content is not allowed in prolog.
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown 
Source)
        at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
        at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
        at 
org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
        ... 5 more



I would be very grateful if anyone could get me to solve this issue I have been 
trying to fix for a couple of days.


Regards,
ShivprasadS


Confidentiality Notice: This e-mail message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, please 
contact the sender by reply e-mail, delete and then destroy all copies of the 
original message.

Reply via email to