[
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614953#comment-14614953
]
Karl Wright commented on CONNECTORS-1219:
-----------------------------------------
Hi Abe-san,
Looking at this code:
{code}
private LuceneDocument buildDocument(String documentURI, RepositoryDocument
document) throws Exception {
LuceneDocument doc = new LuceneDocument();
doc = LuceneDocument.addField(doc, client.idField(), documentURI,
client.fieldsInfo());
try
{
Reader r = new InputStreamReader(document.getBinaryStream(),
StandardCharsets.UTF_8);
StringBuilder sb = new StringBuilder((int)document.getBinaryLength());
char[] buffer = new char[65536];
while (true)
{
int amt = r.read(buffer,0,buffer.length);
if (amt == -1)
break;
sb.append(buffer,0,amt);
}
doc = LuceneDocument.addField(doc, client.contentField(), sb.toString(),
client.fieldsInfo());
} catch (Exception e) {
if (e instanceof IOException) {
Logging.connectors.error("[Parsing Content]Content is not text plain,
verify you are properly using Apache Tika Transformer " + documentURI, e);
} else {
throw e;
}
}
Iterator<String> it = document.getFields();
while (it.hasNext()) {
String rdField = it.next();
if (client.fieldsInfo().containsKey(rdField)) {
try
{
String[] values = document.getFieldAsStrings(rdField);
for (String value : values) {
doc = LuceneDocument.addField(doc, rdField, value,
client.fieldsInfo());
}
} catch (IOException e) {
Logging.connectors.error("[Getting Field Values]Impossible to read
value for metadata " + rdField + " " + documentURI, e);
}
}
}
return doc;
}
{code}
As you can see, there is no limit on the amount of memory that would be
required to index a single document. A 10gb document would require 10gb or
more of memory. The potential amount of memory varies also by the number of
worker threads -- if all 30 worker threads all happen to want to index a 10gb
document at the same time, the memory requirement would be 300gb. Indeed,
there is no memory size that you could set which would work reliably.
We have this problem also with the Solr connector when the extracting update
handler is not used -- in that case, we *require* the user to set a maximum
file length value. Even that is not a good solution, but it is the only one
possible given Solr's standard update handler architecture. For a Lucene
connector, we would need to have similar required constraints.
> Lucene Output Connector
> -----------------------
>
> Key: CONNECTORS-1219
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
> Project: ManifoldCF
> Issue Type: New Feature
> Reporter: Shinichiro Abe
> Assignee: Shinichiro Abe
> Attachments: CONNECTORS-1219-v0.1patch.patch,
> CONNECTORS-1219-v0.2.patch
>
>
> A output connector for Lucene local index directly, not via remote search
> engine. It would be nice if we could use Lucene various API to the index
> directly, even though we could do the same thing to the Solr or Elasticsearch
> index. I assume we can do something to classification, categorization, and
> tagging, using e.g lucene-classification package.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)