[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-06-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682628#comment-13682628
 ] 

Tim Allison commented on TIKA-1132:
---

Tika gui took longer than I was willing to wait, too.  tika.parseToString() 
returned a value in about 30 seconds. As you both suggested, the fraction 
formatter was likely the culprit.  I just submitted a patch to poi 54686.

 Parsing some XLS documents hangs entire JVM, requires kill -9
 -

 Key: TIKA-1132
 URL: https://issues.apache.org/jira/browse/TIKA-1132
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2, 1.3
 Environment: Linux Suse:
 java version 1.7.0
 Java(TM) SE Runtime Environment (build 1.7.0-b147)
 Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
 OSX 10.8.3:
 java version 1.7.0_06
 Java(TM) SE Runtime Environment (build 1.7.0_06-b24)
 Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode)
Reporter: Ryan Krueger
 Fix For: 1.1

 Attachments: mod3.xlsx, mod.xls


 Some XLS documents hang the entire JVM.  A control-C or regular kill won't 
 stop the JVM, a kill -9 is required.
 We're running within an email server application parsing documents to extract 
 text of all attachments.  When we hit a message with the affected attachment 
 the entire JVM hangs and we mark the message to skip extracting the text from 
 the affected message the next attempt.  Unfortunately, it kills all email 
 processing on the server until the internal watchdogs kill -9 the application.
 We have seen the issue for several months with different documents, but they 
 are always Excel files.  Some get complaints from Excel when opening but not 
 all.
 In addition to experiencing the problem on our Linux servers I have tested on 
 OSX and experienced the same problems.  I ran the Tika UI and select the 
 affected file or run the CLI.  The problem is the same.
 Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls
 When running on multi-CPU machines there are two threads running at 100% 
 every time.
 I have attached a document that triggers the error.
 I have tested with 1.2 and 1.3 with the same result.  Running 1.1 the text is 
 accurately extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-06-11 Thread Ryan Krueger (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680747#comment-13680747
 ] 

Ryan Krueger commented on TIKA-1132:


Running jvisualvm and pulling a thread dump I get the same trace each time:

main prio=10 tid=0x00606800 nid=0x7799 runnable [0x7fe26bf1d000]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.poi.ss.usermodel.DataFormatter$FractionFormat.format(DataFormatter.java:1009)
at 
org.apache.poi.ss.usermodel.DataFormatter$FractionFormat.format(DataFormatter.java:1033)
at java.text.Format.format(Format.java:157)
at 
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:699)
at 
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:669)
at 
org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell(FormatTrackingHSSFListener.java:129)
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:419)
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:323)
at 
org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:82)
at 
org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:112)
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:147)
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:299)
at 
org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:151)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)




Looking at POI 3.8 in grepcode I see the affected code.  The methods appear to 
be unchanged in 3.9.

I don't know what's causing the issue as it doesn't immediately appear to me to 
be an infinite loop.

Here is the apparent section from org.apache.poi.ss.usermodel.DataFormatter.

1005 double minVal = 1.0;
1006 double currDenom = Math.pow(10 ,  fractParts[1].length()) - 1d;
1007 double currNeum = 0;
1008 for (int i = (int)(Math.pow(10,  fractParts[1].length())- 1d); 
i  0; i--) {
1009for(int i2 = (int)(Math.pow(10,  fractParts[1].length())- 
1d); i2  0; i2--){
1010   if (minVal =  Math.abs((double)i2/(double)i - decPart)) 
{
1011  currDenom = i;
1012  currNeum = i2;
1013  minVal = Math.abs((double)i2/(double)i  - decPart);
1014   }
1015}
1016 }

 Parsing some XLS documents hangs entire JVM, requires kill -9
 -

 Key: TIKA-1132
 URL: https://issues.apache.org/jira/browse/TIKA-1132
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2, 1.3
 Environment: Linux Suse:
 java version 1.7.0
 Java(TM) SE Runtime Environment (build 1.7.0-b147)
 Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
 OSX 10.8.3:
 java version 1.7.0_06
 Java(TM) SE Runtime Environment (build 1.7.0_06-b24)
 Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode)
Reporter: Ryan Krueger
 Fix For: 1.1

 Attachments: mod.xls


 Some XLS documents hang the entire JVM.  A control-C or regular kill won't 
 stop the JVM, a kill -9 is required.
 We're running within an email server application parsing documents to extract 
 text of all attachments.  When we hit a message with the affected attachment 
 the entire JVM hangs and we mark the message to skip extracting the text from 
 the affected message the next attempt.  Unfortunately, it kills all email 
 processing on the server until the internal watchdogs kill -9 the application.
 We have seen the issue for several months with different documents, but they 
 are always Excel files.  Some get complaints from Excel when opening but not 
 all.
 In addition to experiencing the problem on our Linux servers I have 

[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-06-11 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680800#comment-13680800
 ] 

Nick Burch commented on TIKA-1132:
--

Thanks for the test file. There's an open bug in poi about fraction formatting, 
it might be the same thing. I'll hopefully be able to take a look in the next 
few days, other work permitting

 Parsing some XLS documents hangs entire JVM, requires kill -9
 -

 Key: TIKA-1132
 URL: https://issues.apache.org/jira/browse/TIKA-1132
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2, 1.3
 Environment: Linux Suse:
 java version 1.7.0
 Java(TM) SE Runtime Environment (build 1.7.0-b147)
 Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
 OSX 10.8.3:
 java version 1.7.0_06
 Java(TM) SE Runtime Environment (build 1.7.0_06-b24)
 Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode)
Reporter: Ryan Krueger
 Fix For: 1.1

 Attachments: mod3.xlsx, mod.xls


 Some XLS documents hang the entire JVM.  A control-C or regular kill won't 
 stop the JVM, a kill -9 is required.
 We're running within an email server application parsing documents to extract 
 text of all attachments.  When we hit a message with the affected attachment 
 the entire JVM hangs and we mark the message to skip extracting the text from 
 the affected message the next attempt.  Unfortunately, it kills all email 
 processing on the server until the internal watchdogs kill -9 the application.
 We have seen the issue for several months with different documents, but they 
 are always Excel files.  Some get complaints from Excel when opening but not 
 all.
 In addition to experiencing the problem on our Linux servers I have tested on 
 OSX and experienced the same problems.  I ran the Tika UI and select the 
 affected file or run the CLI.  The problem is the same.
 Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls
 When running on multi-CPU machines there are two threads running at 100% 
 every time.
 I have attached a document that triggers the error.
 I have tested with 1.2 and 1.3 with the same result.  Running 1.1 the text is 
 accurately extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-06-10 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680021#comment-13680021
 ] 

Nick Burch commented on TIKA-1132:
--

I can confirm that it goes into an infinite loop for me too

Any chance that you could run it in a profiler or similar, and track down where 
the loop is happening? (My hunch is it'll be an edge case in POI / POI not 
handling a subtle form of corruption)

 Parsing some XLS documents hangs entire JVM, requires kill -9
 -

 Key: TIKA-1132
 URL: https://issues.apache.org/jira/browse/TIKA-1132
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2, 1.3
 Environment: Linux Suse:
 java version 1.7.0
 Java(TM) SE Runtime Environment (build 1.7.0-b147)
 Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
 OSX 10.8.3:
 java version 1.7.0_06
 Java(TM) SE Runtime Environment (build 1.7.0_06-b24)
 Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode)
Reporter: Ryan Krueger
 Fix For: 1.1

 Attachments: mod.xls


 Some XLS documents hang the entire JVM.  A control-C or regular kill won't 
 stop the JVM, a kill -9 is required.
 We're running within an email server application parsing documents to extract 
 text of all attachments.  When we hit a message with the affected attachment 
 the entire JVM hangs and we mark the message to skip extracting the text from 
 the affected message the next attempt.  Unfortunately, it kills all email 
 processing on the server until the internal watchdogs kill -9 the application.
 We have seen the issue for several months with different documents, but they 
 are always Excel files.  Some get complaints from Excel when opening but not 
 all.
 In addition to experiencing the problem on our Linux servers I have tested on 
 OSX and experienced the same problems.  I ran the Tika UI and select the 
 affected file or run the CLI.  The problem is the same.
 Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls
 When running on multi-CPU machines there are two threads running at 100% 
 every time.
 I have attached a document that triggers the error.
 I have tested with 1.2 and 1.3 with the same result.  Running 1.1 the text is 
 accurately extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira