[ 
https://issues.apache.org/jira/browse/CONNECTORS-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1008:
------------------------------------

    Description: 
The Solr Connector, when receiving documents that have special characters, does 
not manage to send these appropriately to Solr for indexing.  I have verified 
that the upstream Tika extractor does its job perfectly, so the problem is 
either in the Solr Connector or in SolrJ.



  was:
The Tika extractor, when extracting content from a PDF (specifically, the en_US 
end-user-documentation pdf), does not handle anything other than Latin-1 
characters properly.  For example, it does not convert the copyright symbol, or 
any CJK characters, to utf-8.


       Priority: Critical  (was: Major)
        Summary: Solr connector via /update handler doesn't seem to handle 
special characters correctly  (was: Tika Extractor doesn't seem to handle 
special characters correctly)

> Solr connector via /update handler doesn't seem to handle special characters 
> correctly
> --------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1008
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1008
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Lucene/SOLR connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>            Priority: Critical
>             Fix For: ManifoldCF 1.7
>
>
> The Solr Connector, when receiving documents that have special characters, 
> does not manage to send these appropriately to Solr for indexing.  I have 
> verified that the upstream Tika extractor does its job perfectly, so the 
> problem is either in the Solr Connector or in SolrJ.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to