[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920704#action_12920704
 ] 

Jack Krupansky commented on CONNECTORS-118:
-------------------------------------------

Karl correctly points out that "The key question here is how you describe the 
component of an archive.  There must be a URL to describe it..." I am basing my 
request on the subcrawling feature of Aperture, which is basing archive support 
on Apache Commons VFS.

See:
http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers

Which says:

The uris of the data objects found inside other data objects have a fixed form, 
consisting of three basic parts:

        <prefix>:<parent-object-uri>!/<path>

* <prefix> - the uri prefix, characteristic for a particular SubCrawler, 
returned by the SubCrawlerFactory.getUriPrefix() method
* <parent-object-uri> - the uri of the parent data object, it is obtained from 
the parentMetadata parameter to the subCrawl() method, by calling 
RDFContainer.getDescribedUri()
* <path> - an internal path of the 'child' data object inside the 'parent' data 
object

This scheme has been inspired by the apache commons VFS project, homepaged 
under http://commons.apache.org/vfs

See:
http://commons.apache.org/vfs/filesystems.html

Which says:

Provides read-only access to the contents of Zip, Jar and Tar files.

URI Format

zip:// arch-file-uri [! absolute-path ]
jar:// arch-file-uri [! absolute-path ]
tar:// arch-file-uri [! absolute-path ]
tgz:// arch-file-uri [! absolute-path ]
tbz2:// arch-file-uri [! absolute-path ]

Where arch-file-uri refers to a file of any supported type, including other zip 
files. Note: if you would like to use the ! as normal character it must be 
escaped using %21.
tgz and tbz2 are convenience for tar:gz and tar:bz2.

Examples

jar:../lib/classes.jar!/META-INF/manifest.mf
zip:http://somehost/downloads/somefile.zip
jar:zip:outer.zip!/nested.jar!/somedir
jar:zip:outer.zip!/nested.jar!/some%21dir
tar:gz:http://anyhost/dir/mytar.tar.gz!/mytar.tar!/path/in/tar/README.txt
tgz:file://anyhost/dir/mytar.tgz!/somepath/somefile

----

Provides read-only access to the contents of gzip and bzip2 files.

URI Format

gz:// compressed-file-uri
bz2:// compressed-file-uri

Where compressed-file-uri refers to a file of any supported type. There is no 
need to add a ! part to the uri if you read the content of the file you always 
will get the uncompressed version.

Examples

gz:/my/gz/file.gz


> Crawled archive files should be expanded into their constituent files
> ---------------------------------------------------------------------
>
>                 Key: CONNECTORS-118
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Framework crawler agent
>            Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to