[
https://issues.apache.org/jira/browse/CONNECTORS-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094158#comment-14094158
]
Karl Wright commented on CONNECTORS-1008:
-----------------------------------------
The problem is definitely in SolrJ. The content is added as follows:
{code}
File f = new File("outputlog.log");
OutputStream os = new FileOutputStream(f,true);
Writer w = new OutputStreamWriter(os,"utf-8");
w.write(sb.toString());
w.flush();
os.close();
outputDoc.addField( contentAttributeName, sb.toString() );
{code}
... where outputDoc is a SolrInputDocument.
The outputlog.log file is written with perfect utf-8, but by the time the
SolrInputDocument is handled by SolrJ and unpacked on the Solr side, it's
corrupted. Looks like another Solr ticket should be created.
The workaround is to use the extracting update handler even with Tika output.
In this case everything should work properly.
> Solr connector via /update handler doesn't seem to handle special characters
> correctly
> --------------------------------------------------------------------------------------
>
> Key: CONNECTORS-1008
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1008
> Project: ManifoldCF
> Issue Type: Bug
> Components: Lucene/SOLR connector
> Affects Versions: ManifoldCF 1.7
> Reporter: Karl Wright
> Assignee: Karl Wright
> Priority: Critical
> Fix For: ManifoldCF 1.7
>
>
> The Solr Connector, when receiving documents that have special characters,
> does not manage to send these appropriately to Solr for indexing. I have
> verified that the upstream Tika extractor does its job perfectly, so the
> problem is either in the Solr Connector or in SolrJ.
--
This message was sent by Atlassian JIRA
(v6.2#6252)