[ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615489#comment-14615489
 ] 

Shinichiro Abe commented on CONNECTORS-1219:
--------------------------------------------

Thank you for the review. Added Maximumdocumentlength params and field, 
r1689479 to the branch.

It seems to me that isInteger() function at editconnection.jsp doesn't strictly 
check for integer value IIUC, is it expected? Solr connector's max length check 
on the jsp could be also passed to long value.
BTW, if it was used Integer.MAX_VALUE on the field, StringBuilder init would 
raise OOM when adding big binary in the connection because char array exceeded 
max capacity.

And big binary was be able to reject to ingest by having max length, but I 
found another OOMs which were caused by Lucene.

{noformat}
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at 
org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter$FieldData.<init>(CompressingTermVectorsWriter.java:157)
        at 
org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter$DocData.addField(CompressingTermVectorsWriter.java:106)
        at 
org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.startField(CompressingTermVectorsWriter.java:287)
        at 
org.apache.lucene.index.TermVectorsConsumerPerField.finishDocument(TermVectorsConsumerPerField.java:81)
        at 
org.apache.lucene.index.TermVectorsConsumer.finishDocument(TermVectorsConsumer.java:110)
        at org.apache.lucene.index.TermsHash.finishDocument(TermsHash.java:93)
        at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:316)
        at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
        at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
        at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
        at 
org.apache.manifoldcf.agents.output.lucene.LuceneClient.addOrReplace(LuceneClient.java:321)
        at 
org.apache.manifoldcf.agents.output.lucene.LuceneConnector.addOrReplaceDocumentWithException(LuceneConnector.java:333)
        at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3221)
{noformat}
I will add term_vector true|false option on the fields.
{noformat}
Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:345)
        at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.writeField(CompressingStoredFieldsWriter.java:297)
        at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:361)
        at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
        at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
        at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
        at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
        at 
org.apache.manifoldcf.agents.output.lucene.LuceneClient.addOrReplace(LuceneClient.java:321)
        at 
org.apache.manifoldcf.agents.output.lucene.LuceneConnector.addOrReplaceDocumentWithException(LuceneConnector.java:333)
        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester
{noformat}
This OOM could be resolved by tika write limit.
 

> Lucene Output Connector
> -----------------------
>
>                 Key: CONNECTORS-1219
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>         Attachments: CONNECTORS-1219-v0.1patch.patch, 
> CONNECTORS-1219-v0.2.patch
>
>
> A output connector for Lucene local index directly, not via remote search 
> engine. It would be nice if we could use Lucene various API to the index 
> directly, even though we could do the same thing to the Solr or Elasticsearch 
> index. I assume we can do something to classification, categorization, and 
> tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to