[jira] [Commented] (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015438#comment-13015438 ] Karl Wright commented on CONNECTORS-118: I just ran into a convention that (apparently) slf4j uses for archive files: [jar:file:/opt/ovi/search/servlet/tomcat/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] I don't know how universal this is but it deserves exploration. > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: File system connector, JCIFS connector, Web connector >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015019#comment-13015019 ] Karl Wright commented on CONNECTORS-118: This ticket is stalled. The driver behind it was being able to support a feature that Aperture has. The way it would need to be done in ManifoldCF is to have individual connectors deal with the feature. Each connector that supports it would know how to generate a specialized URL which referred to the archive contents, and the document identifiers for such connectors would also need to be changed to be able to represent archive contents as well. The connectors under consideration would be the file system connector, the JCIFS connector, and the Web connector. > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920827#action_12920827 ] Karl Wright commented on CONNECTORS-118: Agreed, file system is quite straightforward, although CIFS may be a bit more challenging depending on whether the archive processing code accepts an InputStream as input. If so, there would be no need to make a secondary copy in either case. > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920805#action_12920805 ] Jack Krupansky commented on CONNECTORS-118: --- At least for file system crawls we can depend on modification date to decide whether to re-crawl an archive file, can't we? I wouldn't rate crawling of archive files over the web efficiently too high a priority. > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920801#action_12920801 ] Jack Krupansky commented on CONNECTORS-118: --- One of those VFS links points to all the Java packages used to access the list of archive formats I listed. I have personally written unit tests that generated most of those formats which Aperture then extracted. > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920800#action_12920800 ] Karl Wright commented on CONNECTORS-118: If Aperture is generating URLs that nobody can use, then we have no reason to duplicate that approach. If Aperture generates URLs that actually work, I'd like to know how. If it turns out that Aperture URLs aren't worth copying, then I think what we want is to use the # symbol to separate the archive part of the URL from the path within the archive. There's some trickiness here because every version of IE there ever was treats special characters in the file iri differently, unfortunately. MetaCarta wound up needing to rely on javascript to properly translate MCF file iri's in the web client, depending on the browser type. But I'm sure I can figure out a solution to that. The anchor approach will work with all connectors, but it sounds like the web, file system, and CIFS connector are of the most interest. There are two independent technical challenges left. First is how to unpack the archive from Java. Unpacking jars and older zips has native support, but I am unaware of any packages that do that for the variety of archive types claimed by Aperture. Perhaps, Jack, you can look at the Aperture code and post exactly how they do all that. Second, it's not going to be terribly efficient to download and unpack an archive repeatedly to extract its contents one item at a time. I will have to think about some way of transferring the archive to local storage so that it does not need to be repeatedly refetched as it is crawled. The issue would not be caching it, but rather knowing when to discard a cached copy. Maybe we can use the appropriate http headers for this purpose to figure out if it has changed. Alternatively, we can keep it around for some period of time before discarding it automatically. > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920787#action_12920787 ] Jack Krupansky commented on CONNECTORS-118: --- Aperture's approach was just a starting point for discussion for how to form an id for a file in an archive file. As long as the MCF rules are functionally equivalent to the Apache VFS rules, we should be okay. In short, my proposal does not have a requirement for what an id should look like, just a suggestion. > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920781#action_12920781 ] Karl Wright commented on CONNECTORS-118: bq. So, if somebody wants to de-reference one of these pseudo URLS they must: Ah. So what you are saying is that the person must either be running a custom browser, or must do some kind of URL manipulation before the search results would be presented to the user, or - or what, exactly? If the url is in fact meant to be real, then it should refer to a custom proxy of some kind that would perform the necessary breakdown. If there is no such service or proxy, those URLs will simply be broken. This represents a major violation of the contract for url generation within ManifoldCF connectors. If there is no such proxy that you are aware of, then I'd much rather generate a real url, which in its raw form would not send you to anything other than the archive itself, but which has enough information to be interpreted properly, by using the anchor trick I alluded to earlier. If there *is* such a proxy, then that proxy's parameters must be added as part of the repository connection configuration. The only case in which the solution you suggest is valid is if you are working on a file system where, when you go to your browser, you enter "bz://..." for the url, and it actually does the unpacking for you. That would *not* include CIFS, by the way. Is this a fair statement of your proposal? Or am I missing something? > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920730#action_12920730 ] Jack Krupansky commented on CONNECTORS-118: --- Support within the file system connector is obviously the higher priority. Windows shares as well. And FTP/SFTP. > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920720#action_12920720 ] Jack Krupansky commented on CONNECTORS-118: --- Just to be clear, this subcrawling proosal does not depend on Apache VFS, but as does Aperture it simply borrows the naming convention for representing the id for each file as a pseudo-URL, not a real URL. So, if somebody wants to de-reference one of these pseudo URLS they must: 1) Separate the prefix, parent-object-uri, and path from the pseudo-URL. 2) Fetch the file from the parent-object-uri. 3) Use an access library based on the prefix to extract the file at the path from within the fetched archive. > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920712#action_12920712 ] Karl Wright commented on CONNECTORS-118: But the URL scheme you provide will not actually work unless the repository being crawled is a file system built on Apache VFS. So there is no point in talking about SharePoint or Web. I can't see what good crawling something is, if you cannot locate the actual file when you are done. The URL that ManifoldCF sends to the index should be a *real* url, one that you can click on, that will take you to the document in question. If you just want a placeholder, fine, then just use anchors to do what you want, e.g.: http://foo.bar.com/something/archive.gz#my/file/path/in/archive.gz The point here is that the *right* url depends critically on the kind of repository you are crawling, because the url must actually *function* in the context of that repository. Furthermore, people who put archives into content management systems are usually rounded up and shot, because that completely defeats the purpose of such systems. So I would believe you might find archives on the web, or in a file system, but I'd be hard pressed to believe anywhere else. > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920711#action_12920711 ] Jack Krupansky commented on CONNECTORS-118: --- Subcrawling is based on the file type (zip, tar, gzip, bzip2, mbox, jar, etc.), not the type of repository that contains it. I can't speak about all repository types, but subcrawling would apply to web and SharePoint in addition to file system and share crawling. Basically, any repository type that returns files, as opposed to say the JDBC connector which is returning a row of data values rather than a file. > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920705#action_12920705 ] Karl Wright commented on CONNECTORS-118: So this scheme is specific to Apache VFS. What connectors are used to crawl Apache VFS file systems? just the file system connector, no? > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920704#action_12920704 ] Jack Krupansky commented on CONNECTORS-118: --- Karl correctly points out that "The key question here is how you describe the component of an archive. There must be a URL to describe it..." I am basing my request on the subcrawling feature of Aperture, which is basing archive support on Apache Commons VFS. See: http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers Which says: The uris of the data objects found inside other data objects have a fixed form, consisting of three basic parts: :!/ * - the uri prefix, characteristic for a particular SubCrawler, returned by the SubCrawlerFactory.getUriPrefix() method * - the uri of the parent data object, it is obtained from the parentMetadata parameter to the subCrawl() method, by calling RDFContainer.getDescribedUri() * - an internal path of the 'child' data object inside the 'parent' data object This scheme has been inspired by the apache commons VFS project, homepaged under http://commons.apache.org/vfs See: http://commons.apache.org/vfs/filesystems.html Which says: Provides read-only access to the contents of Zip, Jar and Tar files. URI Format zip:// arch-file-uri [! absolute-path ] jar:// arch-file-uri [! absolute-path ] tar:// arch-file-uri [! absolute-path ] tgz:// arch-file-uri [! absolute-path ] tbz2:// arch-file-uri [! absolute-path ] Where arch-file-uri refers to a file of any supported type, including other zip files. Note: if you would like to use the ! as normal character it must be escaped using %21. tgz and tbz2 are convenience for tar:gz and tar:bz2. Examples jar:../lib/classes.jar!/META-INF/manifest.mf zip:http://somehost/downloads/somefile.zip jar:zip:outer.zip!/nested.jar!/somedir jar:zip:outer.zip!/nested.jar!/some%21dir tar:gz:http://anyhost/dir/mytar.tar.gz!/mytar.tar!/path/in/tar/README.txt tgz:file://anyhost/dir/mytar.tgz!/somepath/somefile Provides read-only access to the contents of gzip and bzip2 files. URI Format gz:// compressed-file-uri bz2:// compressed-file-uri Where compressed-file-uri refers to a file of any supported type. There is no need to add a ! part to the uri if you read the content of the file you always will get the uncompressed version. Examples gz:/my/gz/file.gz > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920609#action_12920609 ] Karl Wright commented on CONNECTORS-118: The key question here is how you describe the component of an archive. There must be a URL to describe it, or there is no way the search results are going to mean anything. Since URL's are the connector's job to assemble, this is likely to be connector specific. Also, most connectors will never be dealing with archives. Can you provide a list of connectors where you believe this is important, and what the URL's to get at the subpieces of the archive look like? > Crawled archive files should be expanded into their constituent files > - > > Key: CONNECTORS-118 > URL: https://issues.apache.org/jira/browse/CONNECTORS-118 > Project: ManifoldCF > Issue Type: New Feature > Components: Framework crawler agent >Reporter: Jack Krupansky > > Archive files such as zip, mbox, tar, etc. should be expanded into their > constituent files during crawling of repositories so that any output > connector would output the flattened archive. > This could be an option, defaulted to ON, since someone may want to implement > a "copy" connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.