[
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920704#action_12920704
]
Jack Krupansky commented on CONNECTORS-118:
-------------------------------------------
Karl correctly points out that "The key question here is how you describe the
component of an archive. There must be a URL to describe it..." I am basing my
request on the subcrawling feature of Aperture, which is basing archive support
on Apache Commons VFS.
See:
http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers
Which says:
The uris of the data objects found inside other data objects have a fixed form,
consisting of three basic parts:
<prefix>:<parent-object-uri>!/<path>
* <prefix> - the uri prefix, characteristic for a particular SubCrawler,
returned by the SubCrawlerFactory.getUriPrefix() method
* <parent-object-uri> - the uri of the parent data object, it is obtained from
the parentMetadata parameter to the subCrawl() method, by calling
RDFContainer.getDescribedUri()
* <path> - an internal path of the 'child' data object inside the 'parent' data
object
This scheme has been inspired by the apache commons VFS project, homepaged
under http://commons.apache.org/vfs
See:
http://commons.apache.org/vfs/filesystems.html
Which says:
Provides read-only access to the contents of Zip, Jar and Tar files.
URI Format
zip:// arch-file-uri [! absolute-path ]
jar:// arch-file-uri [! absolute-path ]
tar:// arch-file-uri [! absolute-path ]
tgz:// arch-file-uri [! absolute-path ]
tbz2:// arch-file-uri [! absolute-path ]
Where arch-file-uri refers to a file of any supported type, including other zip
files. Note: if you would like to use the ! as normal character it must be
escaped using %21.
tgz and tbz2 are convenience for tar:gz and tar:bz2.
Examples
jar:../lib/classes.jar!/META-INF/manifest.mf
zip:http://somehost/downloads/somefile.zip
jar:zip:outer.zip!/nested.jar!/somedir
jar:zip:outer.zip!/nested.jar!/some%21dir
tar:gz:http://anyhost/dir/mytar.tar.gz!/mytar.tar!/path/in/tar/README.txt
tgz:file://anyhost/dir/mytar.tgz!/somepath/somefile
----
Provides read-only access to the contents of gzip and bzip2 files.
URI Format
gz:// compressed-file-uri
bz2:// compressed-file-uri
Where compressed-file-uri refers to a file of any supported type. There is no
need to add a ! part to the uri if you read the content of the file you always
will get the uncompressed version.
Examples
gz:/my/gz/file.gz
> Crawled archive files should be expanded into their constituent files
> ---------------------------------------------------------------------
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
> Issue Type: New Feature
> Components: Framework crawler agent
> Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their
> constituent files during crawling of repositories so that any output
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement
> a "copy" connector that maintains crawled files as-is.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.