Copilot commented on code in PR #2863:
URL: https://github.com/apache/tika/pull/2863#discussion_r3351189626
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java:
##########
@@ -328,6 +358,22 @@ private String getStreamName(Metadata metadata) {
return MIMES_TO_NAME.get(mimeString);
}
+ /**
+ * Peeks at the stream signature to determine whether it is a pack200
archive, without
+ * consuming the stream. Used so pack200 can be routed through the
COMPRESS-721 workaround in
+ * {@link #parse}.
+ *
+ * @param tis the input, which must support mark/reset (a TikaInputStream
always does)
+ * @return {@code true} if the signature matches pack200
+ */
+ private static boolean isPack200(TikaInputStream tis) {
+ try {
+ return
CompressorStreamFactory.PACK200.equals(CompressorStreamFactory.detect(tis));
+ } catch (CompressorException e) {
+ return false;
+ }
+ }
Review Comment:
`isPack200()` currently calls `CompressorStreamFactory.detect(tis)`. For
inputs without a content-type hint this means detection runs twice (once here,
and again in `factory.createCompressorInputStream(tis)`), adding avoidable
overhead on every non-pack200 stream. Since pack200 has a fixed 4-byte
signature (CAFED00D), this can be implemented as a cheap peek without invoking
commons-compress detection at all.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]