[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920712#action_12920712
 ] 

Karl Wright commented on CONNECTORS-118:
----------------------------------------

 But the URL scheme you provide will not actually work unless the repository 
being crawled is a file system built on Apache VFS.  So there is no point in 
talking about SharePoint or Web.  I can't see what good crawling something is, 
if you cannot locate the actual file when you are done.

The URL that ManifoldCF sends to the index should be a *real* url, one that you 
can click on, that will take you to the document in question.  If you just want 
a placeholder, fine, then just use anchors to do what you want, e.g.:

http://foo.bar.com/something/archive.gz#my/file/path/in/archive.gz

The point here is that the *right* url depends critically on the kind of 
repository you are crawling, because the url must actually *function* in the 
context of that repository.  Furthermore, people who put archives into content 
management systems are usually rounded up and shot, because that completely 
defeats the purpose of such systems.  So I would believe you might find 
archives on the web, or in a file system, but I'd be hard pressed to believe 
anywhere else.



> Crawled archive files should be expanded into their constituent files
> ---------------------------------------------------------------------
>
>                 Key: CONNECTORS-118
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Framework crawler agent
>            Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to