[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

Karl Wright (JIRA) Wed, 13 Oct 2010 16:01:00 -0700

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920800#action_12920800
 ]


Karl Wright commented on CONNECTORS-118:
----------------------------------------

If Aperture is generating URLs that nobody can use, then we have no reason to 
duplicate that approach.  If Aperture generates URLs that actually work, I'd 
like to know how.

If it turns out that Aperture URLs aren't worth copying, then I think what we 
want is to use the # symbol to separate the archive part of the URL from the 
path within the archive.  There's some trickiness here because every version of 
IE there ever was treats special characters in the file iri differently, 
unfortunately.  MetaCarta wound up needing to rely on javascript to properly 
translate MCF file iri's in the web client, depending on the browser type.  But 
I'm sure I can figure out a solution to that.

The anchor approach will work with all connectors, but it sounds like the web, 
file system, and CIFS connector are of the most interest.  There are two 
independent technical challenges left.  First is how to unpack the archive from 
Java.  Unpacking jars and older zips has native support, but I am unaware of 
any packages that do that for the variety of archive types claimed by Aperture. 
 Perhaps, Jack, you can look at the Aperture code and post exactly how they do 
all that.

Second, it's not going to be terribly efficient to download and unpack an 
archive repeatedly to extract its contents one item at a time.   I will have to 
think about some way of transferring the archive to local storage so that it 
does not need to be repeatedly refetched as it is crawled.  The issue would not 
be caching it, but rather knowing when to discard a cached copy.  Maybe we can 
use the appropriate http headers for this purpose to figure out if it has 
changed.  Alternatively, we can keep it around for some period of time before 
discarding it automatically.


> Crawled archive files should be expanded into their constituent files
> ---------------------------------------------------------------------
>
>                 Key: CONNECTORS-118
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Framework crawler agent
>            Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

Reply via email to