[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920712#action_12920712 ]
Karl Wright commented on CONNECTORS-118: ---------------------------------------- But the URL scheme you provide will not actually work unless the repository being crawled is a file system built on Apache VFS. So there is no point in talking about SharePoint or Web. I can't see what good crawling something is, if you cannot locate the actual file when you are done. The URL that ManifoldCF sends to the index should be a *real* url, one that you can click on, that will take you to the document in question. If you just want a placeholder, fine, then just use anchors to do what you want, e.g.: http://foo.bar.com/something/archive.gz#my/file/path/in/archive.gz The point here is that the *right* url depends critically on the kind of repository you are crawling, because the url must actually *function* in the context of that repository. Furthermore, people who put archives into content management systems are usually rounded up and shot, because that completely defeats the purpose of such systems. So I would believe you might find archives on the web, or in a file system, but I'd be hard pressed to believe anywhere else. > Crawled archive files should be expanded into their constituent files > --------------------------------------------------------------------- > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent > Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.