Allow controlling an important PDF processing parameter in Tika that splits the
words in text and is now suppored in version 1.0 of Tika.
-----------------------------------------------------------------------------------------------------------------------------------------
Key: SOLR-2930
URL: https://issues.apache.org/jira/browse/SOLR-2930
Project: Solr
Issue Type: Improvement
Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 3.5
Reporter: Ravish Bhagdev
Tika 1.0 has fixed a major issue with processing and parsing of PDF files that
was splitting the words incorrectly:
https://issues.apache.org/jira/browse/TIKA-724
This causes text to be indexed incorrectly in solr and it becomes specially
visible when using spellcheck features etc.
They have added a special parameter set using setEnableAutoSpace that fixes the
problem but there is currently no way of setting this when using Solr. As
discussed in thread on above issue, it would be nice if we could control this
(and in future other) parameter via Solr configuration.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]