[jira] [Updated] (TIKA-2519) Issue parsing multiple CHM files concurrently

2017-12-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2519:
--
Priority: Blocker  (was: Minor)

> Issue parsing multiple CHM files concurrently
> -
>
> Key: TIKA-2519
> URL: https://issues.apache.org/jira/browse/TIKA-2519
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.16
>Reporter: Eamonn Saunders
>Priority: Blocker
>
> Should I expect to be able to parse multiple CHM files concurrently in 
> multiple threads?
> What I'm noticing when attempting to parse 2 different CHM files in different 
> threads is that:
> - ChmExtractor.extractChmEntry() gets a ChmBlockInfo as follows:
> {code}
> ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance(
> directoryListingEntry, (int) getChmLzxcResetTable()
> .getBlockLen(), getChmLzxcControlData());
> {code}
> - ChmBlockInfo.getChmBlockInfoInstance() is a static method that appears to 
> limit the number of ChmBlockInfo instances to 1.
> {code}
> public static ChmBlockInfo getChmBlockInfoInstance(
> DirectoryListingEntry dle, int bytesPerBlock,
> ChmLzxcControlData clcd) {
> setChmBlockInfo(new ChmBlockInfo());
> getChmBlockInfo().setStartBlock(dle.getOffset() / bytesPerBlock);
> getChmBlockInfo().setEndBlock(
> (dle.getOffset() + dle.getLength()) / bytesPerBlock);
> getChmBlockInfo().setStartOffset(dle.getOffset() % bytesPerBlock);
> getChmBlockInfo().setEndOffset(
> (dle.getOffset() + dle.getLength()) % bytesPerBlock);
> // potential problem with casting long to int
> getChmBlockInfo().setIniBlock(
> getChmBlockInfo().startBlock - getChmBlockInfo().startBlock
> % (int) clcd.getResetInterval());
> //(getChmBlockInfo().startBlock - 
> getChmBlockInfo().startBlock)
> //% (int) clcd.getResetInterval());
> return getChmBlockInfo();
> }
> {code}
> Is there a good reason why there should only ever be one instance of 
> ChmBlockInfo?
> Should we forget about attempting to process CHM files in parallel and 
> instead queue them up to be processed sequentially?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2519) Issue parsing multiple CHM files concurrently

2017-12-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281225#comment-16281225
 ] 

Tim Allison commented on TIKA-2519:
---

Thank you for opening this issue. That’s definitely a bug. Parsers should be 
multi-threadable.

> Issue parsing multiple CHM files concurrently
> -
>
> Key: TIKA-2519
> URL: https://issues.apache.org/jira/browse/TIKA-2519
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.16
>Reporter: Eamonn Saunders
>Priority: Minor
>
> Should I expect to be able to parse multiple CHM files concurrently in 
> multiple threads?
> What I'm noticing when attempting to parse 2 different CHM files in different 
> threads is that:
> - ChmExtractor.extractChmEntry() gets a ChmBlockInfo as follows:
> {code}
> ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance(
> directoryListingEntry, (int) getChmLzxcResetTable()
> .getBlockLen(), getChmLzxcControlData());
> {code}
> - ChmBlockInfo.getChmBlockInfoInstance() is a static method that appears to 
> limit the number of ChmBlockInfo instances to 1.
> {code}
> public static ChmBlockInfo getChmBlockInfoInstance(
> DirectoryListingEntry dle, int bytesPerBlock,
> ChmLzxcControlData clcd) {
> setChmBlockInfo(new ChmBlockInfo());
> getChmBlockInfo().setStartBlock(dle.getOffset() / bytesPerBlock);
> getChmBlockInfo().setEndBlock(
> (dle.getOffset() + dle.getLength()) / bytesPerBlock);
> getChmBlockInfo().setStartOffset(dle.getOffset() % bytesPerBlock);
> getChmBlockInfo().setEndOffset(
> (dle.getOffset() + dle.getLength()) % bytesPerBlock);
> // potential problem with casting long to int
> getChmBlockInfo().setIniBlock(
> getChmBlockInfo().startBlock - getChmBlockInfo().startBlock
> % (int) clcd.getResetInterval());
> //(getChmBlockInfo().startBlock - 
> getChmBlockInfo().startBlock)
> //% (int) clcd.getResetInterval());
> return getChmBlockInfo();
> }
> {code}
> Is there a good reason why there should only ever be one instance of 
> ChmBlockInfo?
> Should we forget about attempting to process CHM files in parallel and 
> instead queue them up to be processed sequentially?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2519) Issue parsing multiple CHM files concurrently

2017-12-06 Thread Eamonn Saunders (JIRA)
Eamonn Saunders created TIKA-2519:
-

 Summary: Issue parsing multiple CHM files concurrently
 Key: TIKA-2519
 URL: https://issues.apache.org/jira/browse/TIKA-2519
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.16
Reporter: Eamonn Saunders
Priority: Minor


Should I expect to be able to parse multiple CHM files concurrently in multiple 
threads?
What I'm noticing when attempting to parse 2 different CHM files in different 
threads is that:

- ChmExtractor.extractChmEntry() gets a ChmBlockInfo as follows:
{code}
ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance(
directoryListingEntry, (int) getChmLzxcResetTable()
.getBlockLen(), getChmLzxcControlData());
{code}
- ChmBlockInfo.getChmBlockInfoInstance() is a static method that appears to 
limit the number of ChmBlockInfo instances to 1.
{code}
public static ChmBlockInfo getChmBlockInfoInstance(
DirectoryListingEntry dle, int bytesPerBlock,
ChmLzxcControlData clcd) {
setChmBlockInfo(new ChmBlockInfo());
getChmBlockInfo().setStartBlock(dle.getOffset() / bytesPerBlock);
getChmBlockInfo().setEndBlock(
(dle.getOffset() + dle.getLength()) / bytesPerBlock);
getChmBlockInfo().setStartOffset(dle.getOffset() % bytesPerBlock);
getChmBlockInfo().setEndOffset(
(dle.getOffset() + dle.getLength()) % bytesPerBlock);
// potential problem with casting long to int
getChmBlockInfo().setIniBlock(
getChmBlockInfo().startBlock - getChmBlockInfo().startBlock
% (int) clcd.getResetInterval());
//(getChmBlockInfo().startBlock - getChmBlockInfo().startBlock)
//% (int) clcd.getResetInterval());
return getChmBlockInfo();
}
{code}

Is there a good reason why there should only ever be one instance of 
ChmBlockInfo?

Should we forget about attempting to process CHM files in parallel and instead 
queue them up to be processed sequentially?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Tika 1.17?

2017-12-06 Thread Luís Filipe Nassif
Hi Tim,

I've had a briefly look at exceptions folder, seems we are much better with
ppt (4677 fixed exceptions) and pdf (7798), but there are 208 new
exceptions with ppt. I did not check the files to see if they are
corrupted, but some common tokens were lost. Below the most common new
stacktrace:

org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the
class for type with id 1000 on class class
org.apache.poi.hslf.record.Document :
java.lang.reflect.InvocationTargetException
Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't
instantiate the class for type with id 1010 on class class
org.apache.poi.hslf.record.Environment :
java.lang.reflect.InvocationTargetException
Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't
instantiate the class for type with id 2005 on class class
org.apache.poi.hslf.record.FontCollection :
java.lang.reflect.InvocationTargetException
Cause was : java.lang.IllegalArgumentException: typeface can't be null nor
empty
at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:186)
at org.apache.poi.hslf.record.Record.buildRecordAtOffset(Record.java:104)
at
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(HSLFSlideShowImpl.java:279)
at
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(HSLFSlideShowImpl.java:260)
at
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.(HSLFSlideShowImpl.java:166)
at
org.apache.poi.hslf.usermodel.HSLFSlideShow.(HSLFSlideShow.java:181)
at
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:78)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:179)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)
at
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
at
org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:406)
at
org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)
at
org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:181)
at
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)
at
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor283.newInstance(Unknown Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:182)
... 25 more
Caused by: org.apache.poi.hslf.exceptions.HSLFException: Couldn't
instantiate the class for type with id 1010 on class class
org.apache.poi.hslf.record.Environment :
java.lang.reflect.InvocationTargetException
Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't
instantiate the class for type with id 2005 on class class
org.apache.poi.hslf.record.FontCollection :
java.lang.reflect.InvocationTargetException
Cause was : java.lang.IllegalArgumentException: typeface can't be null nor
empty
at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:186)
at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:129)
at org.apache.poi.hslf.record.Document.(Document.java:133)
... 29 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor285.newInstance(Unknown Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:182)
... 31 more
Caused by: org.apache.poi.hslf.exceptions.HSLFException: Couldn't
instantiate the class for type with id 2005 on class class
org.apache.poi.hslf.record.FontCollection :
java.lang.reflect.InvocationTargetException
Cause was : java.lang.IllegalArgumentException: typeface can't be null nor
empty
at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:186)
at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:129)
at