[ https://issues.apache.org/jira/browse/TIKA-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281915#comment-16281915 ]
Tim Allison commented on TIKA-2519: ----------------------------------- I'm seeing this when I run the code against chm multithreaded: {noformat} Caused by: org.apache.tika.exception.TikaException: can't copy beyond array length at org.apache.tika.parser.chm.core.ChmCommons.copyOfRange(ChmCommons.java:347) at org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet.enumerateChmDirectoryListingList(ChmDirectoryListingSet.java:144) at org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet.<init>(ChmDirectoryListingSet.java:63) at org.apache.tika.parser.chm.core.ChmExtractor.<init>(ChmExtractor.java:181) at org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:63) {noformat} This is a problem. > Issue parsing multiple CHM files concurrently > --------------------------------------------- > > Key: TIKA-2519 > URL: https://issues.apache.org/jira/browse/TIKA-2519 > Project: Tika > Issue Type: Bug > Affects Versions: 1.16 > Reporter: Eamonn Saunders > Priority: Blocker > > Should I expect to be able to parse multiple CHM files concurrently in > multiple threads? > What I'm noticing when attempting to parse 2 different CHM files in different > threads is that: > - ChmExtractor.extractChmEntry() gets a ChmBlockInfo as follows: > {code} > ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance( > directoryListingEntry, (int) getChmLzxcResetTable() > .getBlockLen(), getChmLzxcControlData()); > {code} > - ChmBlockInfo.getChmBlockInfoInstance() is a static method that appears to > limit the number of ChmBlockInfo instances to 1. > {code} > public static ChmBlockInfo getChmBlockInfoInstance( > DirectoryListingEntry dle, int bytesPerBlock, > ChmLzxcControlData clcd) { > setChmBlockInfo(new ChmBlockInfo()); > getChmBlockInfo().setStartBlock(dle.getOffset() / bytesPerBlock); > getChmBlockInfo().setEndBlock( > (dle.getOffset() + dle.getLength()) / bytesPerBlock); > getChmBlockInfo().setStartOffset(dle.getOffset() % bytesPerBlock); > getChmBlockInfo().setEndOffset( > (dle.getOffset() + dle.getLength()) % bytesPerBlock); > // potential problem with casting long to int > getChmBlockInfo().setIniBlock( > getChmBlockInfo().startBlock - getChmBlockInfo().startBlock > % (int) clcd.getResetInterval()); > // (getChmBlockInfo().startBlock - > getChmBlockInfo().startBlock) > // % (int) clcd.getResetInterval()); > return getChmBlockInfo(); > } > {code} > Is there a good reason why there should only ever be one instance of > ChmBlockInfo? > Should we forget about attempting to process CHM files in parallel and > instead queue them up to be processed sequentially? -- This message was sent by Atlassian JIRA (v6.4.14#64029)