[ 
https://issues.apache.org/jira/browse/SOLR-7764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14620422#comment-14620422
 ] 

Sorin Gheorghiu commented on SOLR-7764:
---------------------------------------

After more test it results this is not a Tika or XML related issue and the 
stacktrace is NOT related to the hang.

1) I removed the XLSX file from the index list (actually I delete it temporary 
on Mediawiki) the Tika error occured and the index didn't hung at this place. 
It seems no error is reported when it hangs permanently on this file (!).

2) A second XLSX file will hang but this time with the following error:

ERROR
PDCIDFont
Error: Could not parse predefined CMAP file for 'é.5s¢-á.?null³!null¯-UCS2'

Thus after I remove both files, the index will end successfully.

As you guessed the information of the files is private, I am allowed to share, 
but not post them. 
Could you provide an email address to send them directly to you, pls?

This issue is related to the newer Solr version, the same files were properly 
indexed before the upgrade 4.5.0 -> 4.7.2

3) It is worth to mention another difference between the versions. 
For long time ago, the docx, xlsx files were not migrated with proper Type 
Content, and they were recognized as ZIP files (that's fine)

in 4.5.0 ExtendedSearch.log reports:
3940: Indexiere hochgeladene Dateien: 8% - Filetype not allowed: zip 
(AtyponJR1_2011.xlsx)

while in 4.7.2 ExtendedSearchIndex.log (different log name) same file is no 
longer recognized as a ZIP archive but it should be, the files are identical.
4117: Indexiere hochgeladene Dateien: 9% - AtyponJR1_2011.xlsx


> Solr indexing hangs if encounters an certain XML parse error
> ------------------------------------------------------------
>
>                 Key: SOLR-7764
>                 URL: https://issues.apache.org/jira/browse/SOLR-7764
>             Project: Solr
>          Issue Type: Bug
>          Components: query parsers
>    Affects Versions: 4.7.2
>         Environment: Ubuntu 12.04.5 LTS
>            Reporter: Sorin Gheorghiu
>              Labels: indexing
>         Attachments: Solr_XML_parse_error_080715.txt
>
>
> BlueSpice (http://bluespice.com/) uses Solr to index documents for the 
> 'Extended search' feature.
> Solr hangs if during indexing certain error occurs:
> 8.7.2015 15:34:26
> ERROR
> SolrCore
> org.apache.solr.common.SolrException: 
> org.apache.tika.exception.TikaException: XML parse error
> 8.7.2015 15:34:26
> ERROR
> SolrDispatchFilter
> null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.TikaException: XML parse error



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to