[ 
https://issues.apache.org/jira/browse/CONNECTORS-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009613#comment-14009613
 ] 

Karl Wright commented on CONNECTORS-916:
----------------------------------------

Another major point:  Tika, it seems to me, may well process documents 
completely in memory.  (Usage of DOM object models makes me suspicious.)  In a 
massively multithreaded environment, this can make ManifoldCF's memory 
footprint be O(N * M), where N is the number of worker threads, and M is the 
maximum size of any document being processed.  For documents that are sometimes 
sized in gigabytes, obviously ManifoldCF will run out of memory.  Heretofore, 
we've tried *very* hard to keep our connectors bounded in memory, for this 
reason.

Takumi, if you can research whether Tika indeed behaves this way, it would be 
great.  If it does, we should take steps to pool Tika instances and limit their 
count, to avoid an unrealistically-large memory footprint.

> Amazon CloudSearch output connector
> -----------------------------------
>
>                 Key: CONNECTORS-916
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-916
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Amazon CloudSearch output connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Takumi Yoshida
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>         Attachments: 0507.diff, 0520.diff, 0520_2.diff, 1.patch, 2.diff, 
> 3.diff, AmazonCloudSearchParam.java, AmazonCloudSearchSpecs.java, 
> exception_handling.diff, exception_handling_2.diff, licenselist.txt
>
>
> I wrote some codes snipetts of output connector for Amazon CloudSearch.
> I would like you to review my code. You can crawl web site and feed HTML page 
> to Amazon CloudSearch.
> but it is not perfectly completed followoing reason.
> - does not write any codes for configuration page.
> - supporting file type is only HTML
> Thank you for your time,
>  Takumi Yoshida



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to