[jira] [Commented] (CONNECTORS-1233) AmazonS3 Repository Connector

Karl Wright (JIRA) Sun, 30 Aug 2015 23:27:07 -0700

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723035#comment-14723035
 ]


Karl Wright commented on CONNECTORS-1233:
-----------------------------------------

[~kbird]: Found another, more serious, problem.

The Amazon API provides document contents as a stream.  This is good, because 
documents are often quite large.  The ManifoldCF RepositoryDocument object 
accepts documents as a stream, so that's all good, right?  Except that you 
aren't taking advantage of this.  Instead, you are reading the document into 
memory, and then writing it back out, for example:

{code}
        // tika works starts
        InputStream in = null;
        ByteArrayOutputStream bao = new ByteArrayOutputStream();

        String document = null;

        try {
          in = s3Obj.getObjectContent();
          IOUtils.copy(in, bao);
          long fileLength = bao.size();
          if(fileLength < 1)
          {
            Logging.connectors.warn("File length 0");
            continue;
          }
{code}

If you are doing this to obtain the document's size, then by ManifoldCF 
conventions you should be writing to a locally created temporary file (making 
certain to clean up that file in a try/finally block).  But if there's another 
way to get the binary size, you should instead be streaming from the input to 
the output, without going through any intermediate other than a fixed-size 
buffer (64K usually suffices).

This is critical in order to maintain the principle of "bounding", which is 
that you can determine MCF's memory requirement as being some fixed number for 
your given configuration.  The way it is coded now, the memory requirement is 
completely unbounded.


> AmazonS3 Repository Connector
> -----------------------------
>
>                 Key: CONNECTORS-1233
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1233
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Gunaratnam Kuhajeyan
>            Assignee: Karl Wright
>              Labels: features
>             Fix For: ManifoldCF 2.3
>
>         Attachments: amazons3patch.diff, amazons3patchnew1.diff, 
> dependencies.docx, patch-removed-unwanted-dependencies-connector-1233.diff, 
> patch-tikaremoved.diff
>
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> Feature Patch 
> AmazonS3 Repository Connector
> AmazonS3 Repository Connector
> A. Overview
> 1. Connects to Amazons3 buckets, and indexes the artifact. if any buckets to 
> be avoided it can be skipped ( it can be configured in job)
> 2. Internally documents are parsed and meta data are extracted using Tika
> 3. Support Locale  - English US ( Currently common_en_US.properties, 
> available, looking for support from some to do the translation for the keys)
> B. Documentation - Work in progress, will be attached issue on the following 
> days
> C. Dependencies - (common-lib)
> 1. aws-java-sdk-{version}.jar
> 2. aws-java-sdk-core-{version}.jar
> 3. aws-java-sdk-s3-{version}.jar
> 4. joda-time-2.2.jar
> D. Connectors.xml
>  <!-- Add your authority connectors here -->
> <authorityconnector name="Amazons3" 
> class="org.apache.manifoldcf.authorities.authorities.amazons3.AmazonS3Authority"/>
>   
> <!-- Add your repository connectors here -->
> <repositoryconnector name="AmazonS3" 
> class="org.apache.manifoldcf.crawler.connectors.amazons3.AmazonS3Connector"/>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CONNECTORS-1233) AmazonS3 Repository Connector

Reply via email to