[
https://issues.apache.org/jira/browse/TIKA-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334123#comment-16334123
]
Andrei Rebegea edited comment on TIKA-1961 at 1/22/18 10:47 AM:
----------------------------------------------------------------
Hello again :) !
I tried upgrading tika to 1.17 and I have upgraded a lot of other required
libraries:
pdfbox to 2.0.8
poi to 3.17
org.json:json to 20171018
The problem reported in this issue is still present:
{code}
//these last few calls repeat for quite a while....
at org.apache.xmlbeans.impl.store.CharUtil$CharJoin.getChars(CharUtil.java:736)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.access$100(CharUtil.java:646)
at org.apache.xmlbeans.impl.store.CharUtil.getChars(CharUtil.java:86)
// ...
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.getChars(CharUtil.java:736)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.access$100(CharUtil.java:646)
at org.apache.xmlbeans.impl.store.CharUtil.getChars(CharUtil.java:86)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.getChars(CharUtil.java:736)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.access$100(CharUtil.java:646)
at org.apache.xmlbeans.impl.store.CharUtil.getChars(CharUtil.java:86)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.getChars(CharUtil.java:736)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.access$100(CharUtil.java:646)
at org.apache.xmlbeans.impl.store.CharUtil.getChars(CharUtil.java:86)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.getChars(CharUtil.java:736)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.access$100(CharUtil.java:646)
at org.apache.xmlbeans.impl.store.CharUtil.getChars(CharUtil.java:86)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:513)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
at
org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
at
org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
at
org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
- locked <0x000000001c78344e> (a org.apache.xmlbeans.impl.store.Locale)
at
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
at
org.openxmlformats.schemas.drawingml.x2006.spreadsheetDrawing.CTDrawing$Factory.parse(null:-1)
at
org.apache.poi.xssf.usermodel.XSSFDrawing.<init>(XSSFDrawing.java:116)
at
org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getShapes(XSSFReader.java:372)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:184)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:135)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:120)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:143)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
at
org.alfresco.repo.content.metadata.TikaPoweredMetadataExtracter.extractRaw(TikaPoweredMetadataExtracter.java:409)
{code}
{code}
2018-01-22 12:24:09,670 DEBUG [content.transform.TransformerDebug]
[http-bio-8080-exec-25] 229.1 Failed Java heap space
{code}
Apache Tika 1.17 depends on org.apache.poi » poi-ooxml 3.17 that depends on
org.apache.poi » poi-ooxml-schemas 3.17 that depends on org.apache.xmlbeans »
xmlbeans 2.6.0
The so called fix for this issue said it would remove the tight coupling with
the Piccolo parser, but from a quick look in the code, I don't see that
happening for this case ( affected file).
Is there any way a voluntary from tika could have a look at this file ?
was (Author: andrei.rebegea):
Hello again :) !
I tried upgrading tika to 1.17 and I have upgraded a lot of other required
libraries:
pdfbox to 2.0.8
poi to 3.17
org.json:json to 20171018
The problem reported in this issue is still present:
{code}
//these last few calls repeat for quite a while....
at org.apache.xmlbeans.impl.store.CharUtil$CharJoin.getChars(CharUtil.java:736)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.access$100(CharUtil.java:646)
at org.apache.xmlbeans.impl.store.CharUtil.getChars(CharUtil.java:86)
// ...
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.getChars(CharUtil.java:736)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.access$100(CharUtil.java:646)
at org.apache.xmlbeans.impl.store.CharUtil.getChars(CharUtil.java:86)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.getChars(CharUtil.java:736)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.access$100(CharUtil.java:646)
at org.apache.xmlbeans.impl.store.CharUtil.getChars(CharUtil.java:86)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.getChars(CharUtil.java:736)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.access$100(CharUtil.java:646)
at org.apache.xmlbeans.impl.store.CharUtil.getChars(CharUtil.java:86)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.getChars(CharUtil.java:736)
at
org.apache.xmlbeans.impl.store.CharUtil$CharJoin.access$100(CharUtil.java:646)
at org.apache.xmlbeans.impl.store.CharUtil.getChars(CharUtil.java:86)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:513)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
at
org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
at
org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
at
org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
- locked <0x000000001c78344e> (a org.apache.xmlbeans.impl.store.Locale)
at
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
at
org.openxmlformats.schemas.drawingml.x2006.spreadsheetDrawing.CTDrawing$Factory.parse(null:-1)
at
org.apache.poi.xssf.usermodel.XSSFDrawing.<init>(XSSFDrawing.java:116)
at
org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getShapes(XSSFReader.java:372)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:184)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:135)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:120)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:143)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
at
org.alfresco.repo.content.metadata.TikaPoweredMetadataExtracter.extractRaw(TikaPoweredMetadataExtracter.java:409)
{code}
{code}
2018-01-22 12:24:09,670 DEBUG [content.transform.TransformerDebug]
[http-bio-8080-exec-25] 229.1 Failed Java heap space
{code}
> OutOfMemory when parsing shapes xml from xlsx files with multi-byte Unicode
> characters
> --------------------------------------------------------------------------------------
>
> Key: TIKA-1961
> URL: https://issues.apache.org/jira/browse/TIKA-1961
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.6
> Reporter: Andrei Rebegea
> Priority: Major
> Fix For: 2.0, 1.13
>
> Attachments: dmsu1332-reproduced.xlsx, problem char separation.png
>
>
> Piccolo parser used by xmlbeans seems to be reading xml files by portions of
> 8192 bytes. Problems appear when a portion crosses a multi-byte Unicode
> character.
> I managed to create a problematic file myself, dmsu1332-reproduced.xlsx.
> Some files got fixed just by opening and saving the files in Office 2013 but
> this one doesn't get fixed by the trick with open/save without modification.
> The file xl/drawings/drawing1.xml within this xlsx contains a formula. The
> border between 1st and 2nd portions (at 0x2000) crosses the same Unicode
> character in the same way: F09D90-BA.
> I noticed that the character before this multi-byte Unicode character should
> be a single-byte character. Otherwise it will be some other issue (not
> OutOfMemory, but just a failure to parse the xml file within the xlsx).
> I don't know if this can be reproduced with two- or three-byte Unicode
> characters, or if other split patter would result into issues (i.e. F0-9D90BA
> and F09D-90BA).
> Problematic char http://unicode.scarfboy.com/?s=U%2B1d43a ;
> Finally with formulas it is reproduced easier because each symbol in a
> formula, if the symbol is automatically typed in italic, such as "a", "x" or
> "dx" (these are two symbols), is represented by a 4-byte Unicode character.
> stack trace:
> java.lang.OutOfMemoryError: Java heap space
> at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
> at
> org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
> at
> org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
> at
> org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
> at
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
> at
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
> at
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
> at
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
> at
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
> at
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
> at
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
> at
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
> at
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
> at
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
> at
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
> at
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
> at
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
> at
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
> at
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> at
> org.openxmlformats.schemas.drawingml.x2006.spreadsheetDrawing.CTDrawing$Factory.parse(Unknown
> Source)
> at
> org.apache.poi.xssf.usermodel.XSSFDrawing.<init>(XSSFDrawing.java:84)
> at
> org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getShapes(XSSFReader.java:294)
> at
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:148)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:114)
> at
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:94)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)