Copilot commented on code in PR #2885:
URL: https://github.com/apache/tika/pull/2885#discussion_r3383111877
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:
##########
@@ -614,7 +614,14 @@ private void collectPictureSlides(ShapeContainer
container, int slideNum,
}
for (HSLFShape shape : shapes) {
if (shape instanceof HSLFPictureShape) {
- HSLFPictureData pd = ((HSLFPictureShape)
shape).getPictureData();
+ HSLFPictureData pd;
+ try {
+ pd = ((HSLFPictureShape) shape).getPictureData();
+ } catch (IndexOutOfBoundsException e) {
+ // corrupt Escher BSE record -- skip page anchoring for
this shape
+ EmbeddedDocumentUtil.recordEmbeddedStreamException(e,
parentMetadata);
+ continue;
Review Comment:
New error-handling path (catching IndexOutOfBoundsException from
HSLFPictureShape#getPictureData) isn’t covered by existing unit tests. Adding a
regression test with a minimal corrupt .ppt that triggers this exception would
help ensure parsing continues and the exception is recorded in parent metadata
as intended.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]