[jira] [Resolved] (SOLR-2930) Allow controlling an important PDF processing parameter in Tika that splits the words in text and is now suppored in version 1.0 of Tika.

Eric Pugh (Jira) Mon, 01 Dec 2025 12:31:29 -0800


     [ 
https://issues.apache.org/jira/browse/SOLR-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eric Pugh resolved SOLR-2930.
-----------------------------
    Resolution: Won't Fix

In Solr 10 we are leveraging either Tika Server (running in it's own seperate 
server process) or maybe Tika Pipes (again, running in a seperate JVM), which 
means this type of change probably won't happen.

> Allow controlling an important PDF processing parameter in Tika that splits 
> the words in text and is now suppored in version 1.0 of Tika.
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2930
>                 URL: https://issues.apache.org/jira/browse/SOLR-2930
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 3.5
>            Reporter: Ravish Bhagdev
>            Priority: Major
>              Labels: pdf, text-splitting, tika,
>
> Tika 1.0 has fixed a major issue with processing and parsing of PDF files 
> that was splitting the words incorrectly: 
> https://issues.apache.org/jira/browse/TIKA-724
> This causes text to be indexed incorrectly in solr and it becomes specially 
> visible when using spellcheck features etc.  
> They have added a special parameter set using setEnableAutoSpace that fixes 
> the problem but there is currently no way of setting this when using Solr.  
> As discussed in thread on above issue, it would be nice if we could control 
> this (and in future other) parameter via Solr configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SOLR-2930) Allow controlling an important PDF processing parameter in Tika that splits the words in text and is now suppored in version 1.0 of Tika.

Reply via email to