Re: [Matterhorn-users] Error extracting text

Andreas . Krieger Wed, 30 Nov 2011 02:22:05 -0800

A similar experience. I have also generated a new dictionary from the wiki 
texts, but this does not make any difference. So would like to switch it off, 
as many I have noticed did.
Which file need to be modified to comment it out?


Non-authorative answer:

in $FELIX_HOME/conf/workflow/*.xml
remove the following lines:
------------------------------------------
    <!-- Run text analysis -->
    <operation
      id="extract-text"
      fail-on-error="false"
      exception-handler-workflow="error"
      description="Extracting text from presentation segments">
      <configurations>

<configurationkey="source-flavor">presentation/trimmed</configuration>

        <configuration key="source-tags"></configuration>
        <configuration key="target-tags">engage</configuration>
      </configurations>
    </operation>
-------------------------------------------
Regards, Andreas

Dr Leslaw Zieleznik schrieb am Wed, 30 Nov 2011 betreff "Re:...":

A similar experience. I have also generated a new dictionary from the wiki 
texts, but this does not make any difference. So would like to switch it off, 
as many I have noticed did.
Which file need to be modified to comment it out?

Leslaw


On 30 Nov 2011, at 08:31, [email protected] wrote:

we also have very poor results on text extraction; but otoh we did not invest 
any time yet into that issue, since other issues are/were of more importance 
currently (get the system running smoothly ;).

just to indicate: you are not alone ;)

Cheers, Andreas

Kristof Keppens schrieb am Wed, 30 Nov 2011 betreff "Re: [Matterhorn-users]...":

We uncommented this, but assumed this is not necessary since our tesseract is 
installed in the default location. Even with this uncommented the results stay 
the same. Very few slides get the text extracted ( but text extraction works 
fine when it does ), most of them don't have any text due to the same issues 
previously stated.

It seems strange to me that no one else has this same issue so maybe we have an 
error somewhere with ffmpeg or something alike since I suspect the issue being 
ffmpeg unable to correctly encode the slides to tif ( most of the time ).



On 2011-11-28 18:17, Jack Vant wrote:

We had this same problem.  I found a file in
/opt/matterhorn/felix/conf/services that seems to serve as a pointer
to the text extraction utility that causes the error.  The file is
org.opencastproject.textextractor.tesseract.TesseractTextExtractor.properties.
 I got rid of the # symbol and restarted my matterhorn services and we
were in business.  Hope this helps.
On Tue, Nov 22, 2011 at 6:29 AM, Kristof Keppens<[email protected]>  wrote:

Hi,
We are getting further with the setup of our matterhorn infrastructure, and
so far most things work and we are almost ready to launch the 1.2 version.
However the problem with the text extraction is still there and I haven't
found a solution so far. I did find the reason why the text extraction
fails, the tif file generated for text extraction is most of the time a
blank grey image, always the same file size and solid grey. Once in a while
there is a correct tif file generated and the text extraction is fine then.
I don't see a clear connection between the successful tif files and the
failed ( it's a ratio of about 1/10 tif's are correct ) ones.
Is anyone else experiencing these problems and found a solution ?
Thanks
Kristof Keppens
Ghent University
On 2011-10-13 14:56, Kristof Keppens wrote:

Hi,
I'm having some issues with the text extraction with our fresh 1.2
installation.
I keep getting the following error:
2011-10-13 13:03:31 WARN (TextAnalyzerServiceImpl:229) - Error
extracting text from
http://ic**.ugent.be:8080/files/collection/composer/550.tif
java.lang.IllegalArgumentException: The text cannot be empty
at
org.opencastproject.metadata.mpeg7.TextualImpl.<init>(TextualImpl.java:81)
at
org.opencastproject.textanalyzer.impl.TextAnalyzerServiceImpl.analyze(TextAnalyzerServiceImpl.java:324)
at
org.opencastproject.textanalyzer.impl.TextAnalyzerServiceImpl.extract(TextAnalyzerServiceImpl.java:194)
at
org.opencastproject.textanalyzer.impl.TextAnalyzerServiceImpl.process(TextAnalyzerServiceImpl.java:253)
at
org.opencastproject.job.api.AbstractJobProducer$JobRunner.call(AbstractJobProducer.java:184)
at
org.opencastproject.job.api.AbstractJobProducer$JobRunner.call(AbstractJobProducer.java:156)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
This error is repeated a number of times in the log. The text extraction
does not fail for every image, just for some images, but as a result the
recording
has the status failed with following error :
org.opencastproject.workflow.api.WorkflowOperationException:
org.opencastproject.workflow.api.WorkflowOperationException: Text
extraction failed on images from
http://ic**.ugent.be:8080/files/mediapackage/5952f751-e8f9-41e5-b55d-7002ca31a67b/8fd9ca3d-cfbc-429a-a035-2ddcbf608412/logica_trimmed.avi
These are tests with manually uploaded files, not sure if this could be
a factor why it fails?
Thanks
Kristof Keppens
_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users


_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users


-----------------------
[email protected]
01/58801 DW 41523
mobil: 0664/60 588 4523
TU Wien
DVR-Nummer: 0005886
-----------------------
_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users


Dr Leslaw Zieleznik
OBIS (Oxford Brookes Information Solutions)
Oxford Brookes University
[email protected]
Tel:  +44 (0)1865 483973


_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users


-----------------------
[email protected]
01/58801 DW 41523
mobil: 0664/60 588 4523
TU Wien
DVR-Nummer: 0005886
-----------------------
_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

Re: [Matterhorn-users] Error extracting text

Reply via email to