[jira] [Updated] (TIKA-2519) Issue parsing multiple CHM files concurrently
[ https://issues.apache.org/jira/browse/TIKA-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2519: -- Priority: Blocker (was: Minor) > Issue parsing multiple CHM files concurrently > - > > Key: TIKA-2519 > URL: https://issues.apache.org/jira/browse/TIKA-2519 > Project: Tika > Issue Type: Bug >Affects Versions: 1.16 >Reporter: Eamonn Saunders >Priority: Blocker > > Should I expect to be able to parse multiple CHM files concurrently in > multiple threads? > What I'm noticing when attempting to parse 2 different CHM files in different > threads is that: > - ChmExtractor.extractChmEntry() gets a ChmBlockInfo as follows: > {code} > ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance( > directoryListingEntry, (int) getChmLzxcResetTable() > .getBlockLen(), getChmLzxcControlData()); > {code} > - ChmBlockInfo.getChmBlockInfoInstance() is a static method that appears to > limit the number of ChmBlockInfo instances to 1. > {code} > public static ChmBlockInfo getChmBlockInfoInstance( > DirectoryListingEntry dle, int bytesPerBlock, > ChmLzxcControlData clcd) { > setChmBlockInfo(new ChmBlockInfo()); > getChmBlockInfo().setStartBlock(dle.getOffset() / bytesPerBlock); > getChmBlockInfo().setEndBlock( > (dle.getOffset() + dle.getLength()) / bytesPerBlock); > getChmBlockInfo().setStartOffset(dle.getOffset() % bytesPerBlock); > getChmBlockInfo().setEndOffset( > (dle.getOffset() + dle.getLength()) % bytesPerBlock); > // potential problem with casting long to int > getChmBlockInfo().setIniBlock( > getChmBlockInfo().startBlock - getChmBlockInfo().startBlock > % (int) clcd.getResetInterval()); > //(getChmBlockInfo().startBlock - > getChmBlockInfo().startBlock) > //% (int) clcd.getResetInterval()); > return getChmBlockInfo(); > } > {code} > Is there a good reason why there should only ever be one instance of > ChmBlockInfo? > Should we forget about attempting to process CHM files in parallel and > instead queue them up to be processed sequentially? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2519) Issue parsing multiple CHM files concurrently
[ https://issues.apache.org/jira/browse/TIKA-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281225#comment-16281225 ] Tim Allison commented on TIKA-2519: --- Thank you for opening this issue. That’s definitely a bug. Parsers should be multi-threadable. > Issue parsing multiple CHM files concurrently > - > > Key: TIKA-2519 > URL: https://issues.apache.org/jira/browse/TIKA-2519 > Project: Tika > Issue Type: Bug >Affects Versions: 1.16 >Reporter: Eamonn Saunders >Priority: Minor > > Should I expect to be able to parse multiple CHM files concurrently in > multiple threads? > What I'm noticing when attempting to parse 2 different CHM files in different > threads is that: > - ChmExtractor.extractChmEntry() gets a ChmBlockInfo as follows: > {code} > ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance( > directoryListingEntry, (int) getChmLzxcResetTable() > .getBlockLen(), getChmLzxcControlData()); > {code} > - ChmBlockInfo.getChmBlockInfoInstance() is a static method that appears to > limit the number of ChmBlockInfo instances to 1. > {code} > public static ChmBlockInfo getChmBlockInfoInstance( > DirectoryListingEntry dle, int bytesPerBlock, > ChmLzxcControlData clcd) { > setChmBlockInfo(new ChmBlockInfo()); > getChmBlockInfo().setStartBlock(dle.getOffset() / bytesPerBlock); > getChmBlockInfo().setEndBlock( > (dle.getOffset() + dle.getLength()) / bytesPerBlock); > getChmBlockInfo().setStartOffset(dle.getOffset() % bytesPerBlock); > getChmBlockInfo().setEndOffset( > (dle.getOffset() + dle.getLength()) % bytesPerBlock); > // potential problem with casting long to int > getChmBlockInfo().setIniBlock( > getChmBlockInfo().startBlock - getChmBlockInfo().startBlock > % (int) clcd.getResetInterval()); > //(getChmBlockInfo().startBlock - > getChmBlockInfo().startBlock) > //% (int) clcd.getResetInterval()); > return getChmBlockInfo(); > } > {code} > Is there a good reason why there should only ever be one instance of > ChmBlockInfo? > Should we forget about attempting to process CHM files in parallel and > instead queue them up to be processed sequentially? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2519) Issue parsing multiple CHM files concurrently
Eamonn Saunders created TIKA-2519: - Summary: Issue parsing multiple CHM files concurrently Key: TIKA-2519 URL: https://issues.apache.org/jira/browse/TIKA-2519 Project: Tika Issue Type: Bug Affects Versions: 1.16 Reporter: Eamonn Saunders Priority: Minor Should I expect to be able to parse multiple CHM files concurrently in multiple threads? What I'm noticing when attempting to parse 2 different CHM files in different threads is that: - ChmExtractor.extractChmEntry() gets a ChmBlockInfo as follows: {code} ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance( directoryListingEntry, (int) getChmLzxcResetTable() .getBlockLen(), getChmLzxcControlData()); {code} - ChmBlockInfo.getChmBlockInfoInstance() is a static method that appears to limit the number of ChmBlockInfo instances to 1. {code} public static ChmBlockInfo getChmBlockInfoInstance( DirectoryListingEntry dle, int bytesPerBlock, ChmLzxcControlData clcd) { setChmBlockInfo(new ChmBlockInfo()); getChmBlockInfo().setStartBlock(dle.getOffset() / bytesPerBlock); getChmBlockInfo().setEndBlock( (dle.getOffset() + dle.getLength()) / bytesPerBlock); getChmBlockInfo().setStartOffset(dle.getOffset() % bytesPerBlock); getChmBlockInfo().setEndOffset( (dle.getOffset() + dle.getLength()) % bytesPerBlock); // potential problem with casting long to int getChmBlockInfo().setIniBlock( getChmBlockInfo().startBlock - getChmBlockInfo().startBlock % (int) clcd.getResetInterval()); //(getChmBlockInfo().startBlock - getChmBlockInfo().startBlock) //% (int) clcd.getResetInterval()); return getChmBlockInfo(); } {code} Is there a good reason why there should only ever be one instance of ChmBlockInfo? Should we forget about attempting to process CHM files in parallel and instead queue them up to be processed sequentially? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Re: Tika 1.17?
Hi Tim, I've had a briefly look at exceptions folder, seems we are much better with ppt (4677 fixed exceptions) and pdf (7798), but there are 208 new exceptions with ppt. I did not check the files to see if they are corrupted, but some common tokens were lost. Below the most common new stacktrace: org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the class for type with id 1000 on class class org.apache.poi.hslf.record.Document : java.lang.reflect.InvocationTargetException Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the class for type with id 1010 on class class org.apache.poi.hslf.record.Environment : java.lang.reflect.InvocationTargetException Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the class for type with id 2005 on class class org.apache.poi.hslf.record.FontCollection : java.lang.reflect.InvocationTargetException Cause was : java.lang.IllegalArgumentException: typeface can't be null nor empty at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:186) at org.apache.poi.hslf.record.Record.buildRecordAtOffset(Record.java:104) at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(HSLFSlideShowImpl.java:279) at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(HSLFSlideShowImpl.java:260) at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.(HSLFSlideShowImpl.java:166) at org.apache.poi.hslf.usermodel.HSLFSlideShow.(HSLFSlideShow.java:181) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:78) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:179) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84) at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) at org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:406) at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104) at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:181) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor283.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:182) ... 25 more Caused by: org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the class for type with id 1010 on class class org.apache.poi.hslf.record.Environment : java.lang.reflect.InvocationTargetException Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the class for type with id 2005 on class class org.apache.poi.hslf.record.FontCollection : java.lang.reflect.InvocationTargetException Cause was : java.lang.IllegalArgumentException: typeface can't be null nor empty at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:186) at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:129) at org.apache.poi.hslf.record.Document.(Document.java:133) ... 29 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor285.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:182) ... 31 more Caused by: org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the class for type with id 2005 on class class org.apache.poi.hslf.record.FontCollection : java.lang.reflect.InvocationTargetException Cause was : java.lang.IllegalArgumentException: typeface can't be null nor empty at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:186) at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:129) at