[jira] [Comment Edited] (TIKA-1961) OutOfMemory when parsing shapes xml from xlsx files with multi-byte Unicode characters

Tim Allison (JIRA) Wed, 27 Apr 2016 08:36:36 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260325#comment-15260325
 ]


Tim Allison edited comment on TIKA-1961 at 4/27/16 3:35 PM:
------------------------------------------------------------

I _think_ [~kiwiwings] recently fixed this in POI by swapping out the piccolo 
parser.  If you could try with a nightly build of Tika (from, say, 
[jenkins|https://builds.apache.org/job/tika-trunk-jdk1.7/973/org.apache.tika$tika-app/])
 and let us know if this is still happening, that'd be great!

Thank you for opening this issue and tracking down the cause of the problem.  I 
just gave up and tried a different parser. :)


was (Author: [email protected]):
I _think_ [~kiwiwings] recently fixed this in POI by swapping out the piccolo 
parser.  If you could try with a nightly build of Tika (from, say, 
[jenkins|https://builds.apache.org/job/tika-trunk-jdk1.7/973/org.apache.tika$tika-app/]
 and let us know if this is still happening, that'd be great!

Thank you for opening this issue and tracking down the source.

> OutOfMemory when parsing shapes xml from xlsx files with multi-byte Unicode 
> characters
> --------------------------------------------------------------------------------------
>
>                 Key: TIKA-1961
>                 URL: https://issues.apache.org/jira/browse/TIKA-1961
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Andrei Rebegea
>         Attachments: dmsu1332-reproduced.xlsx, problem char separation.png
>
>
> Piccolo parser used by xmlbeans seems to be reading xml files by portions of 
> 8192 bytes. Problems appear when a portion crosses a multi-byte Unicode 
> character.
> I managed to create a problematic file myself, dmsu1332-reproduced.xlsx. 
> Some files got fixed just by opening and saving the files in Office 2013 but 
> this one doesn't get fixed by the trick with open/save without modification. 
> The file xl/drawings/drawing1.xml within this xlsx contains a formula. The 
> border between 1st and 2nd portions (at 0x2000) crosses the same Unicode 
> character in the same way: F09D90-BA.
> I noticed that the character before this multi-byte Unicode character should 
> be a single-byte character. Otherwise it will be some other issue (not 
> OutOfMemory, but just a failure to parse the xml file within the xlsx).
> I don't know if this can be reproduced with two- or three-byte Unicode 
> characters, or if other split patter would result into issues (i.e. F0-9D90BA 
> and F09D-90BA).
> Problematic char http://unicode.scarfboy.com/?s=U%2B1d43a ;
> Finally with formulas it is reproduced easier because each symbol in a 
> formula, if the symbol is automatically typed in italic, such as "a", "x" or 
> "dx" (these are two symbols), is represented by a 4-byte Unicode character.
> stack trace:
> java.lang.OutOfMemoryError: Java heap space
>         at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
>         at 
> org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
>         at 
> org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
>         at 
> org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
>         at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
>         at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
>         at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
>         at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
>         at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>         at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
>         at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
>         at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
>         at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
>         at 
> org.openxmlformats.schemas.drawingml.x2006.spreadsheetDrawing.CTDrawing$Factory.parse(Unknown
>  Source)
>         at 
> org.apache.poi.xssf.usermodel.XSSFDrawing.<init>(XSSFDrawing.java:84)
>         at 
> org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getShapes(XSSFReader.java:294)
>         at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:148)
>         at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:114)
>         at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:94)
>         at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>         at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1961) OutOfMemory when parsing shapes xml from xlsx files with multi-byte Unicode characters

Reply via email to