[ 
https://issues.apache.org/jira/browse/SOLR-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Pugh resolved SOLR-2930.
-----------------------------
    Resolution: Won't Fix

In Solr 10 we are leveraging either Tika Server (running in it's own seperate 
server process) or maybe Tika Pipes (again, running in a seperate JVM), which 
means this type of change probably won't happen.

> Allow controlling an important PDF processing parameter in Tika that splits 
> the words in text and is now suppored in version 1.0 of Tika.
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2930
>                 URL: https://issues.apache.org/jira/browse/SOLR-2930
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 3.5
>            Reporter: Ravish Bhagdev
>            Priority: Major
>              Labels: pdf, text-splitting, tika,
>
> Tika 1.0 has fixed a major issue with processing and parsing of PDF files 
> that was splitting the words incorrectly: 
> https://issues.apache.org/jira/browse/TIKA-724
> This causes text to be indexed incorrectly in solr and it becomes specially 
> visible when using spellcheck features etc.  
> They have added a special parameter set using setEnableAutoSpace that fixes 
> the problem but there is currently no way of setting this when using Solr.  
> As discussed in thread on above issue, it would be nice if we could control 
> this (and in future other) parameter via Solr configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to