[
https://issues.apache.org/jira/browse/CONNECTORS-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009613#comment-14009613
]
Karl Wright commented on CONNECTORS-916:
----------------------------------------
Another major point: Tika, it seems to me, may well process documents
completely in memory. (Usage of DOM object models makes me suspicious.) In a
massively multithreaded environment, this can make ManifoldCF's memory
footprint be O(N * M), where N is the number of worker threads, and M is the
maximum size of any document being processed. For documents that are sometimes
sized in gigabytes, obviously ManifoldCF will run out of memory. Heretofore,
we've tried *very* hard to keep our connectors bounded in memory, for this
reason.
Takumi, if you can research whether Tika indeed behaves this way, it would be
great. If it does, we should take steps to pool Tika instances and limit their
count, to avoid an unrealistically-large memory footprint.
> Amazon CloudSearch output connector
> -----------------------------------
>
> Key: CONNECTORS-916
> URL: https://issues.apache.org/jira/browse/CONNECTORS-916
> Project: ManifoldCF
> Issue Type: New Feature
> Components: Amazon CloudSearch output connector
> Affects Versions: ManifoldCF 1.7
> Reporter: Takumi Yoshida
> Assignee: Karl Wright
> Fix For: ManifoldCF 1.7
>
> Attachments: 0507.diff, 0520.diff, 0520_2.diff, 1.patch, 2.diff,
> 3.diff, AmazonCloudSearchParam.java, AmazonCloudSearchSpecs.java,
> exception_handling.diff, exception_handling_2.diff, licenselist.txt
>
>
> I wrote some codes snipetts of output connector for Amazon CloudSearch.
> I would like you to review my code. You can crawl web site and feed HTML page
> to Amazon CloudSearch.
> but it is not perfectly completed followoing reason.
> - does not write any codes for configuration page.
> - supporting file type is only HTML
> Thank you for your time,
> Takumi Yoshida
--
This message was sent by Atlassian JIRA
(v6.2#6252)