A similar experience. I have also generated a new dictionary from the wiki texts, but this does not make any difference. So would like to switch it off, as many I have noticed did. Which file need to be modified to comment it out?
Leslaw On 30 Nov 2011, at 08:31, [email protected] wrote: > we also have very poor results on text extraction; but otoh we did not invest > any time yet into that issue, since other issues are/were of more importance > currently (get the system running smoothly ;). > > just to indicate: you are not alone ;) > > Cheers, Andreas > > Kristof Keppens schrieb am Wed, 30 Nov 2011 betreff "Re: > [Matterhorn-users]...": >> We uncommented this, but assumed this is not necessary since our tesseract >> is installed in the default location. Even with this uncommented the results >> stay the same. Very few slides get the text extracted ( but text extraction >> works fine when it does ), most of them don't have any text due to the same >> issues previously stated. >> >> It seems strange to me that no one else has this same issue so maybe we have >> an error somewhere with ffmpeg or something alike since I suspect the issue >> being ffmpeg unable to correctly encode the slides to tif ( most of the time >> ). >> >> >> >> On 2011-11-28 18:17, Jack Vant wrote: >>> We had this same problem. I found a file in >>> /opt/matterhorn/felix/conf/services that seems to serve as a pointer >>> to the text extraction utility that causes the error. The file is >>> org.opencastproject.textextractor.tesseract.TesseractTextExtractor.properties. >>> I got rid of the # symbol and restarted my matterhorn services and we >>> were in business. Hope this helps. >>> On Tue, Nov 22, 2011 at 6:29 AM, Kristof Keppens<[email protected]> wrote: >>>> Hi, >>>> We are getting further with the setup of our matterhorn infrastructure, and >>>> so far most things work and we are almost ready to launch the 1.2 version. >>>> However the problem with the text extraction is still there and I haven't >>>> found a solution so far. I did find the reason why the text extraction >>>> fails, the tif file generated for text extraction is most of the time a >>>> blank grey image, always the same file size and solid grey. Once in a while >>>> there is a correct tif file generated and the text extraction is fine then. >>>> I don't see a clear connection between the successful tif files and the >>>> failed ( it's a ratio of about 1/10 tif's are correct ) ones. >>>> Is anyone else experiencing these problems and found a solution ? >>>> Thanks >>>> Kristof Keppens >>>> Ghent University >>>> On 2011-10-13 14:56, Kristof Keppens wrote: >>>>> Hi, >>>>> I'm having some issues with the text extraction with our fresh 1.2 >>>>> installation. >>>>> I keep getting the following error: >>>>> 2011-10-13 13:03:31 WARN (TextAnalyzerServiceImpl:229) - Error >>>>> extracting text from >>>>> http://ic**.ugent.be:8080/files/collection/composer/550.tif >>>>> java.lang.IllegalArgumentException: The text cannot be empty >>>>> at >>>>> org.opencastproject.metadata.mpeg7.TextualImpl.<init>(TextualImpl.java:81) >>>>> at >>>>> org.opencastproject.textanalyzer.impl.TextAnalyzerServiceImpl.analyze(TextAnalyzerServiceImpl.java:324) >>>>> at >>>>> org.opencastproject.textanalyzer.impl.TextAnalyzerServiceImpl.extract(TextAnalyzerServiceImpl.java:194) >>>>> at >>>>> org.opencastproject.textanalyzer.impl.TextAnalyzerServiceImpl.process(TextAnalyzerServiceImpl.java:253) >>>>> at >>>>> org.opencastproject.job.api.AbstractJobProducer$JobRunner.call(AbstractJobProducer.java:184) >>>>> at >>>>> org.opencastproject.job.api.AbstractJobProducer$JobRunner.call(AbstractJobProducer.java:156) >>>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>>> at java.lang.Thread.run(Thread.java:662) >>>>> This error is repeated a number of times in the log. The text extraction >>>>> does not fail for every image, just for some images, but as a result the >>>>> recording >>>>> has the status failed with following error : >>>>> org.opencastproject.workflow.api.WorkflowOperationException: >>>>> org.opencastproject.workflow.api.WorkflowOperationException: Text >>>>> extraction failed on images from >>>>> http://ic**.ugent.be:8080/files/mediapackage/5952f751-e8f9-41e5-b55d-7002ca31a67b/8fd9ca3d-cfbc-429a-a035-2ddcbf608412/logica_trimmed.avi >>>>> These are tests with manually uploaded files, not sure if this could be >>>>> a factor why it fails? >>>>> Thanks >>>>> Kristof Keppens >>>>> _______________________________________________ >>>>> Matterhorn-users mailing list >>>>> [email protected] >>>>> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users >>>> _______________________________________________ >>>> Matterhorn-users mailing list >>>> [email protected] >>>> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users >> >> _______________________________________________ >> Matterhorn-users mailing list >> [email protected] >> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users >> > > ----------------------- > [email protected] > 01/58801 DW 41523 > mobil: 0664/60 588 4523 > TU Wien > DVR-Nummer: 0005886 > ----------------------- > _______________________________________________ > Matterhorn-users mailing list > [email protected] > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users Dr Leslaw Zieleznik OBIS (Oxford Brookes Information Solutions) Oxford Brookes University [email protected] Tel: +44 (0)1865 483973 _______________________________________________ Matterhorn-users mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
