Hi, thanks for your quick reply and your explanation. ~Julian
> > -------- Original-Nachricht -------- > Datum: Thu, 17 Feb 2011 08:03:04 -0500 > Von: Karl Wright <[email protected]> > An: [email protected] > Betreff: Re: URISnytaxException > > Hi, > You've done nothing wrong; the stack trace is being dumped because of > a debugging line that was inadvertantly left in the code recently. It > should not change the way the crawl occurs. Regardless, I've removed > the offending line from trunk now. > > In case you are curious, what is happening is that the page link the > crawler has located is not properly URI encoded. Space characters are > illegal in URI's. Normally, the web connector would skip this link > and note that to the log. > > Thanks, > Karl > > > On Thu, Feb 17, 2011 at 7:27 AM, <[email protected]> wrote: > > Hi all, > > > > I just checked out the newest version of MCF and now I am getting this > error > > while crawling certain pages. What can I do against that? > > > > Error Message: > > > > java.net.URISyntaxException: Illegal character in path at index 73: > > /link/to/the/page/alan smithee.xls > > at java.net.URI$Parser.fail(URI.java:2809) > > at java.net.URI$Parser.checkChars(URI.java:2982) > > at java.net.URI$Parser.parseHierarchical(URI.java:3066) > > at java.net.URI$Parser.parse(URI.java:3024) > > at java.net.URI.<init>(URI.java:578) > > at > > > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.makeDocumentIdentifier(WebcrawlerConnector.java:4774) > > at > > > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:5586) > > at > > > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:5701) > > at > > > org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44) > > at > > > org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:48) > > at > > > org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50) > > at > > > org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:223) > > at > > > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:6492) > > at > > > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:5553) > > at > > > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1132) > > at > > > org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) > > at > > > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585) > > > > > > How i set it up (hope that it helps): > > > > installed postgreSQL 8.3.11-1 > > checked out the project into the MCF folder > > added jcifs1.2.15.jar at /connectors/jcifs/jcifs and renamed it to > jcifs.jar > > built the project with ant at /mcf > > copied the content of "dist" to c:/documents and > settings/myUserAccount/lcf > > added the properties.xml and the logging.ini there > > created a synchronization folder > > set MCF_HOME to the folder above > > > > executed in /processes/scripts these commands: > > > > org.apache.manifoldcf.core.DBCreate postgres p0sTgres > > org.apache.manifoldcf.agents.Install > > org.apache.manifoldcf.agents.Register > > org.apache.manifoldcf.crawler.system.CrawlerAgent > > org.apache.manifoldcf.agents.RegisterOutput > > org.apache.manifoldcf.agents.output.solr.SolrConnector "SOLR Connector" > > org.apache.manifoldcf.authorities.RegisterAuthority > > > org.apache.manifoldcf.authorities.authorities.activedirectory.ActiveDirectoryAuthority > > "Active Directory Authority" > > org.apache.manifoldcf.crawler.Register > > org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector > > "Filesystem Connector" > > org.apache.manifoldcf.crawler.Register > > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector "Database > > Connector" > > org.apache.manifoldcf.crawler.Register > > > org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector > > "Windows Share Connector" > > org.apache.manifoldcf.crawler.Register > > org.apache.manifoldcf.crawler.connectors.rss.RSSConnector "RSS > Connector" > > org.apache.manifoldcf.crawler.Register > > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector > "Web > > Connector" > > > > and copied the content of /lcf/web/war to my /tomcat/webapps > > > > Thanks for your help and Best regards, > > Julian > > > > > > -- > > Schon gehört? GMX hat einen genialen Phishing-Filter in die > > Toolbar eingebaut! http://www.gmx.net/de/go/toolbar > -- GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit gratis Handy-Flat! http://portal.gmx.net/de/go/dsl
