[
https://issues.apache.org/jira/browse/CONNECTORS-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karl Wright updated CONNECTORS-1008:
------------------------------------
Description:
The Solr Connector, when receiving documents that have special characters, does
not manage to send these appropriately to Solr for indexing. I have verified
that the upstream Tika extractor does its job perfectly, so the problem is
either in the Solr Connector or in SolrJ.
was:
The Tika extractor, when extracting content from a PDF (specifically, the en_US
end-user-documentation pdf), does not handle anything other than Latin-1
characters properly. For example, it does not convert the copyright symbol, or
any CJK characters, to utf-8.
Priority: Critical (was: Major)
Summary: Solr connector via /update handler doesn't seem to handle
special characters correctly (was: Tika Extractor doesn't seem to handle
special characters correctly)
> Solr connector via /update handler doesn't seem to handle special characters
> correctly
> --------------------------------------------------------------------------------------
>
> Key: CONNECTORS-1008
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1008
> Project: ManifoldCF
> Issue Type: Bug
> Components: Lucene/SOLR connector
> Affects Versions: ManifoldCF 1.7
> Reporter: Karl Wright
> Assignee: Karl Wright
> Priority: Critical
> Fix For: ManifoldCF 1.7
>
>
> The Solr Connector, when receiving documents that have special characters,
> does not manage to send these appropriately to Solr for indexing. I have
> verified that the upstream Tika extractor does its job perfectly, so the
> problem is either in the Solr Connector or in SolrJ.
--
This message was sent by Atlassian JIRA
(v6.2#6252)