[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920827#action_12920827
 ] 

Karl Wright commented on CONNECTORS-118:


Agreed, file system is quite straightforward, although CIFS may be a bit more 
challenging depending on whether the archive processing code accepts an 
InputStream as input.  If so, there would be no need to make a secondary copy 
in either case.


> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920801#action_12920801
 ] 

Jack Krupansky edited comment on CONNECTORS-118 at 10/13/10 7:35 PM:
-

I have personally written unit tests that generated most of those formats which 
Aperture then extracted.

See:
http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers

org.apache.tools.bzip2 - BZIP2 archives.
java.util.zip.GZIPInputStream - GZIP archives.
javax.mail   - message/rfc822-style messages and mbox files.
org.apache.tools.tar - tar archives.



  was (Author: jkrupan):
One of those VFS links points to all the Java packages used to access the 
list of archive formats I listed. I have personally written unit tests that 
generated most of those formats which Aperture then extracted.

  
> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920805#action_12920805
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

At least for file system crawls we can depend on modification date to decide 
whether to re-crawl an archive file, can't we?

I wouldn't rate crawling of archive files over the web efficiently too high a 
priority.


> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920801#action_12920801
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

One of those VFS links points to all the Java packages used to access the list 
of archive formats I listed. I have personally written unit tests that 
generated most of those formats which Aperture then extracted.


> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920800#action_12920800
 ] 

Karl Wright commented on CONNECTORS-118:


If Aperture is generating URLs that nobody can use, then we have no reason to 
duplicate that approach.  If Aperture generates URLs that actually work, I'd 
like to know how.

If it turns out that Aperture URLs aren't worth copying, then I think what we 
want is to use the # symbol to separate the archive part of the URL from the 
path within the archive.  There's some trickiness here because every version of 
IE there ever was treats special characters in the file iri differently, 
unfortunately.  MetaCarta wound up needing to rely on javascript to properly 
translate MCF file iri's in the web client, depending on the browser type.  But 
I'm sure I can figure out a solution to that.

The anchor approach will work with all connectors, but it sounds like the web, 
file system, and CIFS connector are of the most interest.  There are two 
independent technical challenges left.  First is how to unpack the archive from 
Java.  Unpacking jars and older zips has native support, but I am unaware of 
any packages that do that for the variety of archive types claimed by Aperture. 
 Perhaps, Jack, you can look at the Aperture code and post exactly how they do 
all that.

Second, it's not going to be terribly efficient to download and unpack an 
archive repeatedly to extract its contents one item at a time.   I will have to 
think about some way of transferring the archive to local storage so that it 
does not need to be repeatedly refetched as it is crawled.  The issue would not 
be caching it, but rather knowing when to discard a cached copy.  Maybe we can 
use the appropriate http headers for this purpose to figure out if it has 
changed.  Alternatively, we can keep it around for some period of time before 
discarding it automatically.


> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920787#action_12920787
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

Aperture's approach was just a starting point for discussion for how to form an 
id for a file in an archive file. As long as the MCF rules are functionally 
equivalent to the Apache VFS rules, we should be okay.

In short, my proposal does not have a requirement for what an id should look 
like, just a suggestion.


> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920781#action_12920781
 ] 

Karl Wright commented on CONNECTORS-118:


bq. So, if somebody wants to de-reference one of these pseudo URLS they must:

Ah.  So what you are saying is that the person must either be running a custom 
browser, or must do some kind of URL manipulation before the search results 
would be presented to the user, or - or what, exactly?  If the url is in fact 
meant to be real, then it should refer to a custom proxy of some kind that 
would perform the necessary breakdown.  If there is no such service or proxy, 
those URLs will simply be broken.  This represents a major violation of the 
contract for url generation within ManifoldCF connectors.

If there is no such proxy that you are aware of, then I'd much rather generate 
a real url, which in its raw form would not send you to anything other than the 
archive itself, but which has enough information to be interpreted properly, by 
using the anchor trick I alluded to earlier.  If there *is* such a proxy, then 
that proxy's parameters must be added as part of the repository connection 
configuration.  The only case in which the solution you suggest is valid is if 
you are working on a file system where, when you go to your browser, you enter 
"bz://..." for the url, and it actually does the unpacking for you.  That would 
*not* include CIFS, by the way.

Is this a fair statement of your proposal?  Or am I missing something?

> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920730#action_12920730
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

Support within the file system connector is obviously the higher priority. 
Windows shares as well. And FTP/SFTP.


> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920720#action_12920720
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

Just to be clear, this subcrawling proosal does not depend on Apache VFS, but 
as does Aperture it simply borrows the naming convention for representing the 
id for each file as a pseudo-URL, not a real URL.

So, if somebody wants to de-reference one of these pseudo URLS they must:

1) Separate the prefix, parent-object-uri, and path from the pseudo-URL.
2) Fetch the file from the parent-object-uri.
3) Use an access library based on the prefix to extract the file at the path 
from within the fetched archive.


> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920712#action_12920712
 ] 

Karl Wright commented on CONNECTORS-118:


 But the URL scheme you provide will not actually work unless the repository 
being crawled is a file system built on Apache VFS.  So there is no point in 
talking about SharePoint or Web.  I can't see what good crawling something is, 
if you cannot locate the actual file when you are done.

The URL that ManifoldCF sends to the index should be a *real* url, one that you 
can click on, that will take you to the document in question.  If you just want 
a placeholder, fine, then just use anchors to do what you want, e.g.:

http://foo.bar.com/something/archive.gz#my/file/path/in/archive.gz

The point here is that the *right* url depends critically on the kind of 
repository you are crawling, because the url must actually *function* in the 
context of that repository.  Furthermore, people who put archives into content 
management systems are usually rounded up and shot, because that completely 
defeats the purpose of such systems.  So I would believe you might find 
archives on the web, or in a file system, but I'd be hard pressed to believe 
anywhere else.



> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920711#action_12920711
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

Subcrawling is based on the file type (zip, tar, gzip, bzip2, mbox, jar, etc.), 
not the type of repository that contains it. I can't speak about all repository 
types, but subcrawling would apply to web and SharePoint in addition to file 
system and share crawling. Basically, any repository type that returns files, 
as opposed to say the JDBC connector which is returning a row of data values 
rather than a file.


> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920705#action_12920705
 ] 

Karl Wright commented on CONNECTORS-118:


So this scheme is specific to Apache VFS.  What connectors are used to crawl 
Apache VFS file systems?  just the file system connector, no?


> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920704#action_12920704
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

Karl correctly points out that "The key question here is how you describe the 
component of an archive.  There must be a URL to describe it..." I am basing my 
request on the subcrawling feature of Aperture, which is basing archive support 
on Apache Commons VFS.

See:
http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers

Which says:

The uris of the data objects found inside other data objects have a fixed form, 
consisting of three basic parts:

:!/

*  - the uri prefix, characteristic for a particular SubCrawler, 
returned by the SubCrawlerFactory.getUriPrefix() method
*  - the uri of the parent data object, it is obtained from 
the parentMetadata parameter to the subCrawl() method, by calling 
RDFContainer.getDescribedUri()
*  - an internal path of the 'child' data object inside the 'parent' data 
object

This scheme has been inspired by the apache commons VFS project, homepaged 
under http://commons.apache.org/vfs

See:
http://commons.apache.org/vfs/filesystems.html

Which says:

Provides read-only access to the contents of Zip, Jar and Tar files.

URI Format

zip:// arch-file-uri [! absolute-path ]
jar:// arch-file-uri [! absolute-path ]
tar:// arch-file-uri [! absolute-path ]
tgz:// arch-file-uri [! absolute-path ]
tbz2:// arch-file-uri [! absolute-path ]

Where arch-file-uri refers to a file of any supported type, including other zip 
files. Note: if you would like to use the ! as normal character it must be 
escaped using %21.
tgz and tbz2 are convenience for tar:gz and tar:bz2.

Examples

jar:../lib/classes.jar!/META-INF/manifest.mf
zip:http://somehost/downloads/somefile.zip
jar:zip:outer.zip!/nested.jar!/somedir
jar:zip:outer.zip!/nested.jar!/some%21dir
tar:gz:http://anyhost/dir/mytar.tar.gz!/mytar.tar!/path/in/tar/README.txt
tgz:file://anyhost/dir/mytar.tgz!/somepath/somefile



Provides read-only access to the contents of gzip and bzip2 files.

URI Format

gz:// compressed-file-uri
bz2:// compressed-file-uri

Where compressed-file-uri refers to a file of any supported type. There is no 
need to add a ! part to the uri if you read the content of the file you always 
will get the uncompressed version.

Examples

gz:/my/gz/file.gz


> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920609#action_12920609
 ] 

Karl Wright commented on CONNECTORS-118:


The key question here is how you describe the component of an archive.  There 
must be a URL to describe it, or there is no way the search results are going 
to mean anything.

Since URL's are the connector's job to assemble, this is likely to be connector 
specific.  Also, most connectors will never be dealing with archives.  Can you 
provide a list of connectors where you believe this is important, and what the 
URL's to get at the subpieces of the archive look like?


> Crawled archive files should be expanded into their constituent files
> -
>
> Key: CONNECTORS-118
> URL: https://issues.apache.org/jira/browse/CONNECTORS-118
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: Framework crawler agent
>Reporter: Jack Krupansky
>
> Archive files such as zip, mbox, tar, etc. should be expanded into their 
> constituent files during crawling of repositories so that any output 
> connector would output the flattened archive.
> This could be an option, defaulted to ON, since someone may want to implement 
> a "copy" connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)
Crawled archive files should be expanded into their constituent files
-

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky


Archive files such as zip, mbox, tar, etc. should be expanded into their 
constituent files during crawling of repositories so that any output connector 
would output the flattened archive.

This could be an option, defaulted to ON, since someone may want to implement a 
"copy" connector that maintains crawled files as-is.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-116) Possibly remove memex connector depending upon legal resolution

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920578#action_12920578
 ] 

Karl Wright commented on CONNECTORS-116:


It's still not clear what the nature of the issue Memex has with this 
connector.  If it is truly an IP issue, then no amount of clean-room 
implementation will help here.  They have been taking the approach that the 
code was not licensed for this use, but also giving IP as the reason for their 
concern.


> Possibly remove memex connector depending upon legal resolution
> ---
>
> Key: CONNECTORS-116
> URL: https://issues.apache.org/jira/browse/CONNECTORS-116
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Memex connector
>Reporter: Robert Muir
>Assignee: Robert Muir
>
> Apparently there is an IP problem with the memex connector code.
> Depending upon what apache legal says, we will take any action under this 
> issue publicly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-116) Possibly remove memex connector depending upon legal resolution

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920573#action_12920573
 ] 

Karl Wright commented on CONNECTORS-116:


This is not about client libraries (which we do not include).  This is about 
the connector code itself.  And, yes, only Memex has this problem, because all 
other connectors that were developed with the help of contractors used 
third-party contractors who used the typical arrangement that the code they 
developed belonged to MetaCarta.  Memex was the only connector developed using 
the professional services of the target repository company.


> Possibly remove memex connector depending upon legal resolution
> ---
>
> Key: CONNECTORS-116
> URL: https://issues.apache.org/jira/browse/CONNECTORS-116
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Memex connector
>Reporter: Robert Muir
>Assignee: Robert Muir
>
> Apparently there is an IP problem with the memex connector code.
> Depending upon what apache legal says, we will take any action under this 
> issue publicly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-116) Possibly remove memex connector depending upon legal resolution

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920568#action_12920568
 ] 

Jack Krupansky commented on CONNECTORS-116:
---

It would be nice to see a comment about what would be required to add Memex 
support back.

I note the following statement in the original incubation submission:

"It is unlikely that EMC, OpenText, Memex, or IBM would grant 
Apache-license-compatible use of these client libraries. Thus, the expectation 
is that users of these connectors obtain the necessary client libraries from 
the owners prior to building or using the corresponding connector. An 
alternative would be to undertake a clean-room implementation of the client 
API's, which may well yield suitable results in some cases (LiveLink, Memex, 
FileNet), while being out of reach in others (Documentum). Conditional 
compilation, for the short term, is thus likely to be a necessity."

Is it only the Memex connector that now has this problem?

Do we need do a clean-room implementation for Memex? For any of the others?

FWIW, I don't see a Google Connector for Memex.


> Possibly remove memex connector depending upon legal resolution
> ---
>
> Key: CONNECTORS-116
> URL: https://issues.apache.org/jira/browse/CONNECTORS-116
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Memex connector
>Reporter: Robert Muir
>Assignee: Robert Muir
>
> Apparently there is an IP problem with the memex connector code.
> Depending upon what apache legal says, we will take any action under this 
> issue publicly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-116) Possibly remove memex connector depending upon legal resolution

2010-10-13 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920542#action_12920542
 ] 

Mark Miller commented on CONNECTORS-116:


Indeed - my impression is that we are all happy to see this code be pulled if 
that's what the original contributors want (or what they are legally bound to 
want) - we just think that process should be public before the code is silently 
taken out back and shot ;)

> Possibly remove memex connector depending upon legal resolution
> ---
>
> Key: CONNECTORS-116
> URL: https://issues.apache.org/jira/browse/CONNECTORS-116
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Memex connector
>Reporter: Robert Muir
>Assignee: Robert Muir
>
> Apparently there is an IP problem with the memex connector code.
> Depending upon what apache legal says, we will take any action under this 
> issue publicly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-117) Database-specific maintenance activities such as reindexing should have their frequency be under the control of the database driver

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920489#action_12920489
 ] 

Karl Wright commented on CONNECTORS-117:


There are two kinds of activities that matter: ANALYZE, and REINDEX, each on a 
respective table.  The tracking of the number of modifies, inserts, and deletes 
should remain the responsibility of the table manager itself, but a 
notification method for each activity should be implemented.  The database 
driver (only PostgreSQL, so far) will then need to keep track of per-table data 
statically, and make appropriate reindexing decisions.  We can also readily add 
VACUUM FULL maintenance code under this same scheme.

I actually recommend using shared data (as defined within ILockManager) for 
this purpose.  Cross-process statistics then can be tracked, and indexing 
requests can be coordinated.   Eventually this stuff will be in zookeeper, so 
performance will be good.  In the interim, we can commit changes to counts 
lazily (every 100 actions or so) to reduce the overhead.

Modification tracking data will not be lost or reset if the agents process is 
restarted in a multi-process system.  This is new.  In a single-process system, 
it WILL be lost upon restart, which is as it always has been.  FWIW.


> Database-specific maintenance activities such as reindexing should have their 
> frequency be under the control of the database driver
> ---
>
> Key: CONNECTORS-117
> URL: https://issues.apache.org/jira/browse/CONNECTORS-117
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Framework core
>Reporter: Karl Wright
>
> Not all databases will require maintenance activity at the same frequency, 
> and different versions of the same database may also differ in this way.  Two 
> changes should thus be made: (1) Move the database maintenance frequency to 
> be under the control of the database implementation, and (2) Where 
> appropriate, introduce properties.xml properties for each database where this 
> is important.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-116) Possibly remove memex connector depending upon legal resolution

2010-10-13 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920488#action_12920488
 ] 

Jukka Zitting commented on CONNECTORS-116:
--

>From a somewhat related legal-discuss@ thread: "We're not in the business of 
>incorporating code from non-voluntary contributions." -- Justin Erenkranz (see 
>http://markmail.org/message/x5nlmpumncg66zz6)

> Possibly remove memex connector depending upon legal resolution
> ---
>
> Key: CONNECTORS-116
> URL: https://issues.apache.org/jira/browse/CONNECTORS-116
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Memex connector
>Reporter: Robert Muir
>Assignee: Robert Muir
>
> Apparently there is an IP problem with the memex connector code.
> Depending upon what apache legal says, we will take any action under this 
> issue publicly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (CONNECTORS-117) Database-specific maintenance activities such as reindexing should have their frequency be under the control of the database driver

2010-10-13 Thread Karl Wright (JIRA)
Database-specific maintenance activities such as reindexing should have their 
frequency be under the control of the database driver
---

 Key: CONNECTORS-117
 URL: https://issues.apache.org/jira/browse/CONNECTORS-117
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Framework core
Reporter: Karl Wright


Not all databases will require maintenance activity at the same frequency, and 
different versions of the same database may also differ in this way.  Two 
changes should thus be made: (1) Move the database maintenance frequency to be 
under the control of the database implementation, and (2) Where appropriate, 
introduce properties.xml properties for each database where this is important.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (CONNECTORS-115) Restarting the example fails when db present

2010-10-13 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-115.


   Resolution: Fixed
Fix Version/s: LCF Release 0.5
 Assignee: Karl Wright

I found the fix that corresponds to this report.


r1006085 | kwright | 2010-10-08 20:43:45 -0400 (Fri, 08 Oct 2010) | 1 line

Fix problem with postgreSQL implementation which causes second run of DBCreate 
to fail.



> Restarting the example fails when db present
> 
>
> Key: CONNECTORS-115
> URL: https://issues.apache.org/jira/browse/CONNECTORS-115
> Project: ManifoldCF
>  Issue Type: Bug
> Environment: Windows XP, Example running with PostgreSQL instead of 
> embedded derby.  Use defaults for dbname, user, and password.
>Reporter: Farzad
>Assignee: Karl Wright
> Fix For: LCF Release 0.5
>
>
> When you restart the example you get the following:
> C:\Program Files\Apache\apache-acf\example>java -jar start.jar
> Configuration file successfully read
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
> exception: Exception doing query: ERROR: database "d
> bname" already exists
> at 
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:421)
> at 
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:465)
> at 
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1072)
> at 
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
> at 
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167)
> at 
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.createUserAndDatabase(DBInterfacePostgreSQL.java:50
> 8)
> at 
> org.apache.manifoldcf.core.system.ManifoldCF.createSystemDatabase(ManifoldCF.java:638)
> at 
> org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner.main(ManifoldCFJettyRunner.java:202)
> Caused by: org.postgresql.util.PSQLException: ERROR: database "dbname" 
> already exists
> at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1548)
> at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1316)
> at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:191)
> at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:452)
> at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:337)
> at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:329)
> at 
> org.apache.manifoldcf.core.database.Database.execute(Database.java:526)
> at 
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:381)
> C:\Program Files\Apache\apache-acf\example>
> The only way to get it started is dropping the table it created the first 
> time, in this case "dbname".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-116) Possibly remove memex connector depending upon legal resolution

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920482#action_12920482
 ] 

Karl Wright commented on CONNECTORS-116:


I just want to point out that Apache Legal is not the only stakeholder here.  
The community also may choose to take action regardless of the results of 
Apache Legal's review.  Obviously any such action should be done in accordance 
with procedures laid out by Apache Legal, however.


> Possibly remove memex connector depending upon legal resolution
> ---
>
> Key: CONNECTORS-116
> URL: https://issues.apache.org/jira/browse/CONNECTORS-116
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Memex connector
>Reporter: Robert Muir
>Assignee: Robert Muir
>
> Apparently there is an IP problem with the memex connector code.
> Depending upon what apache legal says, we will take any action under this 
> issue publicly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.