[jira] [Commented] (TIKA-1961) OutOfMemory when parsing shapes xml from xlsx files with multi-byte Unicode characters

Tim Allison (JIRA) Wed, 27 Apr 2016 09:04:02 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260372#comment-15260372
 ]


Tim Allison commented on TIKA-1961:
-----------------------------------

Got it.  Will add to our unit tests.  Thank you.  

bq. that we patched locally to not parse the shapes xml anymore
Beware, as you know because you found the source of the problem,  this can 
happen with other xml within docx, pptx, xlsx.  It might happen more often in 
shapes, but this can still happen with the piccolo parser anywhere. Y, 
definitely add "upgrade to >= 1.13" to your roadmap in Alfresco. 

Thank you, again.

> OutOfMemory when parsing shapes xml from xlsx files with multi-byte Unicode 
> characters
> --------------------------------------------------------------------------------------
>
>                 Key: TIKA-1961
>                 URL: https://issues.apache.org/jira/browse/TIKA-1961
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Andrei Rebegea
>         Attachments: dmsu1332-reproduced.xlsx, problem char separation.png
>
>
> Piccolo parser used by xmlbeans seems to be reading xml files by portions of 
> 8192 bytes. Problems appear when a portion crosses a multi-byte Unicode 
> character.
> I managed to create a problematic file myself, dmsu1332-reproduced.xlsx. 
> Some files got fixed just by opening and saving the files in Office 2013 but 
> this one doesn't get fixed by the trick with open/save without modification. 
> The file xl/drawings/drawing1.xml within this xlsx contains a formula. The 
> border between 1st and 2nd portions (at 0x2000) crosses the same Unicode 
> character in the same way: F09D90-BA.
> I noticed that the character before this multi-byte Unicode character should 
> be a single-byte character. Otherwise it will be some other issue (not 
> OutOfMemory, but just a failure to parse the xml file within the xlsx).
> I don't know if this can be reproduced with two- or three-byte Unicode 
> characters, or if other split patter would result into issues (i.e. F0-9D90BA 
> and F09D-90BA).
> Problematic char http://unicode.scarfboy.com/?s=U%2B1d43a ;
> Finally with formulas it is reproduced easier because each symbol in a 
> formula, if the symbol is automatically typed in italic, such as "a", "x" or 
> "dx" (these are two symbols), is represented by a 4-byte Unicode character.
> stack trace:
> java.lang.OutOfMemoryError: Java heap space
>         at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
>         at 
> org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
>         at 
> org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
>         at 
> org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
>         at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
>         at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
>         at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
>         at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>         at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
>         at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
>         at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
>         at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
>         at 
> org.openxmlformats.schemas.drawingml.x2006.spreadsheetDrawing.CTDrawing$Factory.parse(Unknown
>  Source)
>         at 
> org.apache.poi.xssf.usermodel.XSSFDrawing.<init>(XSSFDrawing.java:84)
>         at 
> org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getShapes(XSSFReader.java:294)
>         at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:148)
>         at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:114)
>         at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:94)
>         at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>         at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1961) OutOfMemory when parsing shapes xml from xlsx files with multi-byte Unicode characters

Reply via email to