[ 
https://issues.apache.org/jira/browse/CONNECTORS-981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042626#comment-14042626
 ] 

Karl Wright commented on CONNECTORS-981:
----------------------------------------

Hi Alessandro,

bq.  So the only way is to transform the stream into a String and send it 
within the SolrInputDocument.
I understand your concern, but thinking to the real use case this is what will 
happen :

bq. 1) The Tika connector will parse the file and get the decoded stream in 
utf-8.

Yes, and the Tika connector is constructed so that this does not hit memory; it 
comes out as a character stream from Tika.

bq. 2) If it success, it stores in the RepositoryDocument the textual content 
in a field, if not, the stream will remain as a stream ( which means that we 
don't want to index in Solr)

Not exactly; if success there is a stream which is a stream of utf-8-encoded 
characters.  The document is not loaded into a string.

bq.  3) Solr will take the SolrInputDocument from the RepositoryDocument and 
index it .

To do that you will have to read the entire stream into memory.

bq. Of course with enormous textual file, the memory consumption will be more, 
but at that point the user has to simply configure the JVM properly as he will 
know that he's going to index big amount of data.

How much memory do you suggest the user increase their JVM size by?  Can you 
give me a fixed value?  No, I don't think you can -- and that is the point.  
You would *have* to put an upper limit on the size of a document in order to 
guarantee ANY jvm size limit.

I've done some researching too -- Solr developers realized they had precisely 
this problem, so modern versions of solr (4.7+) have a configurable maximum 
document size.  Beyond that size, the document is rejected.  The only way we 
could use SolrInputDocument would be to do exactly the same thing in 
ManifoldCF.  Not very friendly, but that's about the only answer given Solr 
architecture at the moment.


> Solr Connector - classic Solrj SolrInputDocument support
> --------------------------------------------------------
>
>                 Key: CONNECTORS-981
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-981
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Lucene/SOLR connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Alessandro Benedetti
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7
>
>         Attachments: CONNECTORS-981.patch
>
>
> The solr connector, according with the development of the Tika Connector 
> processor, should be able to operate in 2 ways :
> 1) as usual
> 2) using the classic Solrj SolrInputDocument approach with already extracted 
> metadata
> To allow the choice a flag will be added in the UI in the mapping tab ( as 
> it's related with how the fields will be processed)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to