[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682628#comment-13682628 ] Tim Allison commented on TIKA-1132: --- Tika gui took longer than I was willing to wait, too. tika.parseToString() returned a value in about 30 seconds. As you both suggested, the fraction formatter was likely the culprit. I just submitted a patch to poi 54686. Parsing some XLS documents hangs entire JVM, requires kill -9 - Key: TIKA-1132 URL: https://issues.apache.org/jira/browse/TIKA-1132 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: Linux Suse: java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) OSX 10.8.3: java version 1.7.0_06 Java(TM) SE Runtime Environment (build 1.7.0_06-b24) Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) Reporter: Ryan Krueger Fix For: 1.1 Attachments: mod3.xlsx, mod.xls Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same. Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls When running on multi-CPU machines there are two threads running at 100% every time. I have attached a document that triggers the error. I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680747#comment-13680747 ] Ryan Krueger commented on TIKA-1132: Running jvisualvm and pulling a thread dump I get the same trace each time: main prio=10 tid=0x00606800 nid=0x7799 runnable [0x7fe26bf1d000] java.lang.Thread.State: RUNNABLE at org.apache.poi.ss.usermodel.DataFormatter$FractionFormat.format(DataFormatter.java:1009) at org.apache.poi.ss.usermodel.DataFormatter$FractionFormat.format(DataFormatter.java:1033) at java.text.Format.format(Format.java:157) at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:699) at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:669) at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell(FormatTrackingHSSFListener.java:129) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:419) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:323) at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:82) at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:112) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:147) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:299) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:151) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Looking at POI 3.8 in grepcode I see the affected code. The methods appear to be unchanged in 3.9. I don't know what's causing the issue as it doesn't immediately appear to me to be an infinite loop. Here is the apparent section from org.apache.poi.ss.usermodel.DataFormatter. 1005 double minVal = 1.0; 1006 double currDenom = Math.pow(10 , fractParts[1].length()) - 1d; 1007 double currNeum = 0; 1008 for (int i = (int)(Math.pow(10, fractParts[1].length())- 1d); i 0; i--) { 1009for(int i2 = (int)(Math.pow(10, fractParts[1].length())- 1d); i2 0; i2--){ 1010 if (minVal = Math.abs((double)i2/(double)i - decPart)) { 1011 currDenom = i; 1012 currNeum = i2; 1013 minVal = Math.abs((double)i2/(double)i - decPart); 1014 } 1015} 1016 } Parsing some XLS documents hangs entire JVM, requires kill -9 - Key: TIKA-1132 URL: https://issues.apache.org/jira/browse/TIKA-1132 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: Linux Suse: java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) OSX 10.8.3: java version 1.7.0_06 Java(TM) SE Runtime Environment (build 1.7.0_06-b24) Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) Reporter: Ryan Krueger Fix For: 1.1 Attachments: mod.xls Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have
[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680800#comment-13680800 ] Nick Burch commented on TIKA-1132: -- Thanks for the test file. There's an open bug in poi about fraction formatting, it might be the same thing. I'll hopefully be able to take a look in the next few days, other work permitting Parsing some XLS documents hangs entire JVM, requires kill -9 - Key: TIKA-1132 URL: https://issues.apache.org/jira/browse/TIKA-1132 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: Linux Suse: java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) OSX 10.8.3: java version 1.7.0_06 Java(TM) SE Runtime Environment (build 1.7.0_06-b24) Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) Reporter: Ryan Krueger Fix For: 1.1 Attachments: mod3.xlsx, mod.xls Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same. Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls When running on multi-CPU machines there are two threads running at 100% every time. I have attached a document that triggers the error. I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680021#comment-13680021 ] Nick Burch commented on TIKA-1132: -- I can confirm that it goes into an infinite loop for me too Any chance that you could run it in a profiler or similar, and track down where the loop is happening? (My hunch is it'll be an edge case in POI / POI not handling a subtle form of corruption) Parsing some XLS documents hangs entire JVM, requires kill -9 - Key: TIKA-1132 URL: https://issues.apache.org/jira/browse/TIKA-1132 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: Linux Suse: java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) OSX 10.8.3: java version 1.7.0_06 Java(TM) SE Runtime Environment (build 1.7.0_06-b24) Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) Reporter: Ryan Krueger Fix For: 1.1 Attachments: mod.xls Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same. Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls When running on multi-CPU machines there are two threads running at 100% every time. I have attached a document that triggers the error. I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira