Re: [Matterhorn-users] Error extracting text

Dr Leslaw Zieleznik Wed, 30 Nov 2011 00:59:45 -0800

A similar experience. I have also generated a new dictionary from the wiki 
texts, but this does not make any difference. So would like to switch it off, 
as many I have noticed did.
Which file need to be modified to comment it out?


Leslaw


On 30 Nov 2011, at 08:31, [email protected] wrote:

> we also have very poor results on text extraction; but otoh we did not invest 
> any time yet into that issue, since other issues are/were of more importance 
> currently (get the system running smoothly ;).
> 
> just to indicate: you are not alone ;)
> 
> Cheers, Andreas
> 
> Kristof Keppens schrieb am Wed, 30 Nov 2011 betreff "Re: 
> [Matterhorn-users]...":
>> We uncommented this, but assumed this is not necessary since our tesseract 
>> is installed in the default location. Even with this uncommented the results 
>> stay the same. Very few slides get the text extracted ( but text extraction 
>> works fine when it does ), most of them don't have any text due to the same 
>> issues previously stated.
>> 
>> It seems strange to me that no one else has this same issue so maybe we have 
>> an error somewhere with ffmpeg or something alike since I suspect the issue 
>> being ffmpeg unable to correctly encode the slides to tif ( most of the time 
>> ).
>> 
>> 
>> 
>> On 2011-11-28 18:17, Jack Vant wrote:
>>> We had this same problem.  I found a file in
>>> /opt/matterhorn/felix/conf/services that seems to serve as a pointer
>>> to the text extraction utility that causes the error.  The file is
>>> org.opencastproject.textextractor.tesseract.TesseractTextExtractor.properties.
>>>  I got rid of the # symbol and restarted my matterhorn services and we
>>> were in business.  Hope this helps.
>>> On Tue, Nov 22, 2011 at 6:29 AM, Kristof Keppens<[email protected]>  wrote:
>>>> Hi,
>>>> We are getting further with the setup of our matterhorn infrastructure, and
>>>> so far most things work and we are almost ready to launch the 1.2 version.
>>>> However the problem with the text extraction is still there and I haven't
>>>> found a solution so far. I did find the reason why the text extraction
>>>> fails, the tif file generated for text extraction is most of the time a
>>>> blank grey image, always the same file size and solid grey. Once in a while
>>>> there is a correct tif file generated and the text extraction is fine then.
>>>> I don't see a clear connection between the successful tif files and the
>>>> failed ( it's a ratio of about 1/10 tif's are correct ) ones.
>>>> Is anyone else experiencing these problems and found a solution ?
>>>> Thanks
>>>> Kristof Keppens
>>>> Ghent University
>>>> On 2011-10-13 14:56, Kristof Keppens wrote:
>>>>> Hi,
>>>>> I'm having some issues with the text extraction with our fresh 1.2
>>>>> installation.
>>>>> I keep getting the following error:
>>>>> 2011-10-13 13:03:31 WARN (TextAnalyzerServiceImpl:229) - Error
>>>>> extracting text from
>>>>> http://ic**.ugent.be:8080/files/collection/composer/550.tif
>>>>> java.lang.IllegalArgumentException: The text cannot be empty
>>>>> at
>>>>> org.opencastproject.metadata.mpeg7.TextualImpl.<init>(TextualImpl.java:81)
>>>>> at
>>>>> org.opencastproject.textanalyzer.impl.TextAnalyzerServiceImpl.analyze(TextAnalyzerServiceImpl.java:324)
>>>>> at
>>>>> org.opencastproject.textanalyzer.impl.TextAnalyzerServiceImpl.extract(TextAnalyzerServiceImpl.java:194)
>>>>> at
>>>>> org.opencastproject.textanalyzer.impl.TextAnalyzerServiceImpl.process(TextAnalyzerServiceImpl.java:253)
>>>>> at
>>>>> org.opencastproject.job.api.AbstractJobProducer$JobRunner.call(AbstractJobProducer.java:184)
>>>>> at
>>>>> org.opencastproject.job.api.AbstractJobProducer$JobRunner.call(AbstractJobProducer.java:156)
>>>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>> at java.lang.Thread.run(Thread.java:662)
>>>>> This error is repeated a number of times in the log. The text extraction
>>>>> does not fail for every image, just for some images, but as a result the
>>>>> recording
>>>>> has the status failed with following error :
>>>>> org.opencastproject.workflow.api.WorkflowOperationException:
>>>>> org.opencastproject.workflow.api.WorkflowOperationException: Text
>>>>> extraction failed on images from
>>>>> http://ic**.ugent.be:8080/files/mediapackage/5952f751-e8f9-41e5-b55d-7002ca31a67b/8fd9ca3d-cfbc-429a-a035-2ddcbf608412/logica_trimmed.avi
>>>>> These are tests with manually uploaded files, not sure if this could be
>>>>> a factor why it fails?
>>>>> Thanks
>>>>> Kristof Keppens
>>>>> _______________________________________________
>>>>> Matterhorn-users mailing list
>>>>> [email protected]
>>>>> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>>>> _______________________________________________
>>>> Matterhorn-users mailing list
>>>> [email protected]
>>>> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>> 
>> _______________________________________________
>> Matterhorn-users mailing list
>> [email protected]
>> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>> 
> 
> -----------------------
> [email protected]
> 01/58801 DW 41523
> mobil: 0664/60 588 4523
> TU Wien
> DVR-Nummer: 0005886
> -----------------------
> _______________________________________________
> Matterhorn-users mailing list
> [email protected]
> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

Dr Leslaw Zieleznik
OBIS (Oxford Brookes Information Solutions)
Oxford Brookes University
[email protected]
Tel:  +44 (0)1865 483973


_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

Re: [Matterhorn-users] Error extracting text

Reply via email to