[ https://issues.apache.org/jira/browse/CONNECTORS-981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042626#comment-14042626 ]
Karl Wright commented on CONNECTORS-981: ---------------------------------------- Hi Alessandro, bq. So the only way is to transform the stream into a String and send it within the SolrInputDocument. I understand your concern, but thinking to the real use case this is what will happen : bq. 1) The Tika connector will parse the file and get the decoded stream in utf-8. Yes, and the Tika connector is constructed so that this does not hit memory; it comes out as a character stream from Tika. bq. 2) If it success, it stores in the RepositoryDocument the textual content in a field, if not, the stream will remain as a stream ( which means that we don't want to index in Solr) Not exactly; if success there is a stream which is a stream of utf-8-encoded characters. The document is not loaded into a string. bq. 3) Solr will take the SolrInputDocument from the RepositoryDocument and index it . To do that you will have to read the entire stream into memory. bq. Of course with enormous textual file, the memory consumption will be more, but at that point the user has to simply configure the JVM properly as he will know that he's going to index big amount of data. How much memory do you suggest the user increase their JVM size by? Can you give me a fixed value? No, I don't think you can -- and that is the point. You would *have* to put an upper limit on the size of a document in order to guarantee ANY jvm size limit. I've done some researching too -- Solr developers realized they had precisely this problem, so modern versions of solr (4.7+) have a configurable maximum document size. Beyond that size, the document is rejected. The only way we could use SolrInputDocument would be to do exactly the same thing in ManifoldCF. Not very friendly, but that's about the only answer given Solr architecture at the moment. > Solr Connector - classic Solrj SolrInputDocument support > -------------------------------------------------------- > > Key: CONNECTORS-981 > URL: https://issues.apache.org/jira/browse/CONNECTORS-981 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector > Affects Versions: ManifoldCF 1.7 > Reporter: Alessandro Benedetti > Assignee: Karl Wright > Fix For: ManifoldCF 1.7 > > Attachments: CONNECTORS-981.patch > > > The solr connector, according with the development of the Tika Connector > processor, should be able to operate in 2 ways : > 1) as usual > 2) using the classic Solrj SolrInputDocument approach with already extracted > metadata > To allow the choice a flag will be added in the UI in the mapping tab ( as > it's related with how the fields will be processed) -- This message was sent by Atlassian JIRA (v6.2#6252)