[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

Karl Wright (JIRA) Mon, 06 Jul 2015 05:14:54 -0700

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614953#comment-14614953
 ]


Karl Wright commented on CONNECTORS-1219:
-----------------------------------------

Hi Abe-san,

Looking at this code:

{code}
  private LuceneDocument buildDocument(String documentURI, RepositoryDocument 
document) throws Exception {
    LuceneDocument doc = new LuceneDocument();

    doc = LuceneDocument.addField(doc, client.idField(), documentURI, 
client.fieldsInfo());

    try
    {
      Reader r = new InputStreamReader(document.getBinaryStream(), 
StandardCharsets.UTF_8);
      StringBuilder sb = new StringBuilder((int)document.getBinaryLength());
      char[] buffer = new char[65536];
      while (true)
      {
        int amt = r.read(buffer,0,buffer.length);
        if (amt == -1)
          break;
        sb.append(buffer,0,amt);
      }
      doc = LuceneDocument.addField(doc, client.contentField(), sb.toString(), 
client.fieldsInfo());
    } catch (Exception e) {
      if (e instanceof IOException) {
        Logging.connectors.error("[Parsing Content]Content is not text plain, 
verify you are properly using Apache Tika Transformer " + documentURI, e);
      } else {
        throw e;
      }
    }

    Iterator<String> it = document.getFields();
    while (it.hasNext()) {
      String rdField = it.next();
      if (client.fieldsInfo().containsKey(rdField)) {
        try
        {
          String[] values = document.getFieldAsStrings(rdField);
          for (String value : values) {
            doc = LuceneDocument.addField(doc, rdField, value, 
client.fieldsInfo());
          }
        } catch (IOException e) {
          Logging.connectors.error("[Getting Field Values]Impossible to read 
value for metadata " + rdField + " " + documentURI, e);
        }
      }
    }
    return doc;
  }
{code}

As you can see, there is no limit on the amount of memory that would be 
required to index a single document.  A 10gb document would require 10gb or 
more of memory.  The potential amount of memory varies also by the number of 
worker threads -- if all 30 worker threads all happen to want to index a 10gb 
document at the same time, the memory requirement would be 300gb.  Indeed, 
there is no memory size that you could set which would work reliably.

We have this problem also with the Solr connector when the extracting update 
handler is not used -- in that case, we *require* the user to set a maximum 
file length value.  Even that is not a good solution, but it is the only one 
possible given Solr's standard update handler architecture.  For a Lucene 
connector, we would need to have similar required constraints.




> Lucene Output Connector
> -----------------------
>
>                 Key: CONNECTORS-1219
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>         Attachments: CONNECTORS-1219-v0.1patch.patch, 
> CONNECTORS-1219-v0.2.patch
>
>
> A output connector for Lucene local index directly, not via remote search 
> engine. It would be nice if we could use Lucene various API to the index 
> directly, even though we could do the same thing to the Solr or Elasticsearch 
> index. I assume we can do something to classification, categorization, and 
> tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

Reply via email to