[
https://issues.apache.org/jira/browse/SOLR-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Pugh resolved SOLR-2930.
-----------------------------
Resolution: Won't Fix
In Solr 10 we are leveraging either Tika Server (running in it's own seperate
server process) or maybe Tika Pipes (again, running in a seperate JVM), which
means this type of change probably won't happen.
> Allow controlling an important PDF processing parameter in Tika that splits
> the words in text and is now suppored in version 1.0 of Tika.
> -----------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-2930
> URL: https://issues.apache.org/jira/browse/SOLR-2930
> Project: Solr
> Issue Type: Improvement
> Components: contrib - Solr Cell (Tika extraction)
> Affects Versions: 3.5
> Reporter: Ravish Bhagdev
> Priority: Major
> Labels: pdf, text-splitting, tika,
>
> Tika 1.0 has fixed a major issue with processing and parsing of PDF files
> that was splitting the words incorrectly:
> https://issues.apache.org/jira/browse/TIKA-724
> This causes text to be indexed incorrectly in solr and it becomes specially
> visible when using spellcheck features etc.
> They have added a special parameter set using setEnableAutoSpace that fixes
> the problem but there is currently no way of setting this when using Solr.
> As discussed in thread on above issue, it would be nice if we could control
> this (and in future other) parameter via Solr configuration.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]