[
https://issues.apache.org/jira/browse/TIKA-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Kronenberg updated TIKA-3272:
-----------------------------------
Description:
* Discussed with Tim on the mailing list about avoiding the call to rotate if
the angle of rotation is 0.
* Also, allowing just rotation, without doing the other per-processing, which
has a lot more overhead.
* Replace rotation.py (and Python dependency) with calls to Tess4j classes (2
classes are extracted from the Tess4j package to avoid importing the entire
package)
*ApplyRotation* and *EnableImageProcessing* have been separated so that
*ApplyRotation* does not depend on *EnableImageProcessing*. Doing all the
pre-processing with _ImageMagick_ adds a lot of overhead. Doing the rotation
by itself is much quicker. It’s not clear if the rotation operation is the
fastest one out of all of them or if doing any of them on their own would be
faster. But with rotation, there is an easy way to figure out if the document
needs it. Not so for the other operations.
If *ApplyRotation*=True and *EnableImageProcessing*=False, then _ImageMagick_
will be called *just* to fix the rotation. But only if the rotation angle > 0.
If the angle is 0, then we don’t call _ImageMagick_ at all.
If both *ApplyRotation* and *EnableImageProcessing* are True, then we call
_ImageMagick_ to do all the pre-processing, but we only include rotation if the
angle <L 0.
When determining if the current angle of rotation is 0, we assume anything
where -1.0 < angle < 1.0 is 0. The code that determines the angle appears to
return 0 anyway for anything in this range. This does not affect the accuracy
of the OCR result.
The dependency on Python has been removed. This includes:
* pythonPath in TesseractOCRConfig
* The testing that checks to see if Python is on the system and can be run and
has all the pre-reqs.
was:
* Discussed with Tim on the mailing list about avoiding the call to rotate if
the angle of rotation is 0.
* Also, allowing just rotation, without doing the other per-processing, which
has a lot more overhead.
* Replace rotation.py (and Python dependency) with calls to Tess4j classes (2
classes are extracted from the Tess4j package to avoid importing the entire
package)
> Improve Rotation handling
> -------------------------
>
> Key: TIKA-3272
> URL: https://issues.apache.org/jira/browse/TIKA-3272
> Project: Tika
> Issue Type: Improvement
> Reporter: Peter Kronenberg
> Priority: Major
>
> * Discussed with Tim on the mailing list about avoiding the call to rotate if
> the angle of rotation is 0.
> * Also, allowing just rotation, without doing the other per-processing,
> which has a lot more overhead.
> * Replace rotation.py (and Python dependency) with calls to Tess4j classes
> (2 classes are extracted from the Tess4j package to avoid importing the
> entire package)
>
> *ApplyRotation* and *EnableImageProcessing* have been separated so that
> *ApplyRotation* does not depend on *EnableImageProcessing*. Doing all the
> pre-processing with _ImageMagick_ adds a lot of overhead. Doing the rotation
> by itself is much quicker. It’s not clear if the rotation operation is the
> fastest one out of all of them or if doing any of them on their own would be
> faster. But with rotation, there is an easy way to figure out if the
> document needs it. Not so for the other operations.
> If *ApplyRotation*=True and *EnableImageProcessing*=False, then
> _ImageMagick_ will be called *just* to fix the rotation. But only if the
> rotation angle > 0. If the angle is 0, then we don’t call _ImageMagick_ at
> all.
> If both *ApplyRotation* and *EnableImageProcessing* are True, then we call
> _ImageMagick_ to do all the pre-processing, but we only include rotation if
> the angle <L 0.
> When determining if the current angle of rotation is 0, we assume anything
> where -1.0 < angle < 1.0 is 0. The code that determines the angle appears to
> return 0 anyway for anything in this range. This does not affect the
> accuracy of the OCR result.
> The dependency on Python has been removed. This includes:
> * pythonPath in TesseractOCRConfig
> * The testing that checks to see if Python is on the system and can be run
> and has all the pre-reqs.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)