[
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920712#action_12920712
]
Karl Wright commented on CONNECTORS-118:
----------------------------------------
But the URL scheme you provide will not actually work unless the repository
being crawled is a file system built on Apache VFS. So there is no point in
talking about SharePoint or Web. I can't see what good crawling something is,
if you cannot locate the actual file when you are done.
The URL that ManifoldCF sends to the index should be a *real* url, one that you
can click on, that will take you to the document in question. If you just want
a placeholder, fine, then just use anchors to do what you want, e.g.:
http://foo.bar.com/something/archive.gz#my/file/path/in/archive.gz
The point here is that the *right* url depends critically on the kind of
repository you are crawling, because the url must actually *function* in the
context of that repository. Furthermore, people who put archives into content
management systems are usually rounded up and shot, because that completely
defeats the purpose of such systems. So I would believe you might find
archives on the web, or in a file system, but I'd be hard pressed to believe
anywhere else.
> Crawled archive files should be expanded into their constituent files
> ---------------------------------------------------------------------
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
> Issue Type: New Feature
> Components: Framework crawler agent
> Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their
> constituent files during crawling of repositories so that any output
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement
> a "copy" connector that maintains crawled files as-is.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.