[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859718#comment-17859718 ] Tilman Hausherr commented on TIKA-4251: --- I'm wondering if this means lots of changes to check at the beginning. This is the kindof plugin that would be ideal for a supply chain attack. > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4270) wrong skew angle in tika-parser-ocr-module
[ https://issues.apache.org/jira/browse/TIKA-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4270: -- Description: We use tika to extract text from different sources, including images with text that is rotated at a certain angle. To extract text from image with ocr, tika first deskew image. The skew angle is not calculated correctly. In example [^for_issue] (PNG file), the text is rotated at an angle of ~40 degrees. But the skew angle function (org.apache.tika.parser.ocr.tess4j.ImageDeskew#getSkewAngle) returns an angle of about 15. The slope angle calculation flag is enabled. The documentation (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#:~:text=To%20identify%20rotation) does not have sufficient information for this version of tika, there is a todo box and some relevant information for tika 1 (requires python and its libraries, but in the version of tika we use, angle calculations are implemented only using java) was: We use tika to extract text from different sources, including images with text that is rotated at a certain angle. To extract text from image with ocr, tika first deskew image. The skew angle is not calculated correctly. In example [^for_issue] , the text is rotated at an angle of ~40 degrees. But the skew angle function (org.apache.tika.parser.ocr.tess4j.ImageDeskew#getSkewAngle) returns an angle of about 15. The slope angle calculation flag is enabled. The documentation (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#:~:text=To%20identify%20rotation) does not have sufficient information for this version of tika, there is a todo box and some relevant information for tika 1 (requires python and its libraries, but in the version of tika we use, angle calculations are implemented only using java) > wrong skew angle in tika-parser-ocr-module > -- > > Key: TIKA-4270 > URL: https://issues.apache.org/jira/browse/TIKA-4270 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.1 >Reporter: Roman >Priority: Major > Attachments: for_issue > > > We use tika to extract text from different sources, including images with > text that is rotated at a certain angle. To extract text from image with ocr, > tika first deskew image. The skew angle is not calculated correctly. In > example [^for_issue] (PNG file), the text is rotated at an angle of ~40 > degrees. But the skew angle function > (org.apache.tika.parser.ocr.tess4j.ImageDeskew#getSkewAngle) returns an angle > of about 15. The slope angle calculation flag is enabled. > The documentation > (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#:~:text=To%20identify%20rotation) > does not have sufficient information for this version of tika, there is a > todo box and some relevant information for tika 1 (requires python and its > libraries, but in the version of tika we use, angle calculations are > implemented only using java) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4267. - Resolution: Invalid Closing for now, please comment and/or reopen if needed. > Not getting correct mime type for a few file extensions. example: csv > - > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Affects Versions: 1.28.4 >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4267: -- Summary: Not getting correct mime type for a few file extensions. example: csv (was: Not getting correct mimet type for few file extensions. example :csv) > Not getting correct mime type for a few file extensions. example: csv > - > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Affects Versions: 1.28.4 >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4267: -- Affects Version/s: 1.28.4 > Not getting correct mimet type for few file extensions. example :csv > > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Affects Versions: 1.28.4 >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598 ] Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:06 PM: The current version is 2.9.2, please retry with that one. Get the list of parsers with this code: {code:java} AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); Map parsers = parser.getParsers(context); Tika tika = new Tika(); System.out.println(tika.toString()); System.out.println("List of parsers: "); int idx = 0; for (Map.Entry p : parsers.entrySet()) { MediaType t = p.getKey(); System.out.println((idx + 1) + ".- " + t.getType() + "/" + t.getSubtype()); ++idx; } {code} was (Author: tilman): The current version is 2.9.2, please retry with that one. > Not getting correct mimet type for few file extensions. example :csv > > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598 ] Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:07 PM: The current version is 2.9.2, please retry with that one; if it still doesn't work, please attach your csv file. Get the list of parsers with this code: {code:java} AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); Map parsers = parser.getParsers(context); Tika tika = new Tika(); System.out.println(tika.toString()); System.out.println("List of parsers: "); int idx = 0; for (Map.Entry p : parsers.entrySet()) { MediaType t = p.getKey(); System.out.println((idx + 1) + ".- " + t.getType() + "/" + t.getSubtype()); ++idx; } {code} was (Author: tilman): The current version is 2.9.2, please retry with that one. Get the list of parsers with this code: {code:java} AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); Map parsers = parser.getParsers(context); Tika tika = new Tika(); System.out.println(tika.toString()); System.out.println("List of parsers: "); int idx = 0; for (Map.Entry p : parsers.entrySet()) { MediaType t = p.getKey(); System.out.println((idx + 1) + ".- " + t.getType() + "/" + t.getSubtype()); ++idx; } {code} > Not getting correct mime type for a few file extensions. example: csv > - > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Affects Versions: 1.28.4 >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598 ] Tilman Hausherr commented on TIKA-4267: --- The current version is 2.9.2, please retry with that one. > Not getting correct mimet type for few file extensions. example :csv > > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-1907) Big Pdf parsing to text - Out of memory
[ https://issues.apache.org/jira/browse/TIKA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1907: -- Fix Version/s: 3.0.0 > Big Pdf parsing to text - Out of memory > --- > > Key: TIKA-1907 > URL: https://issues.apache.org/jira/browse/TIKA-1907 > Project: Tika > Issue Type: Bug >Affects Versions: 1.12 >Reporter: Nicolas Daniels >Priority: Major > Fix For: 3.0.0 > > > Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284] > I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe > PDFBox is not the appropriate lib to use in such case. > Trying to read the same PDF using Tika leads to the same problem: > {code:title=Test.java|borderStyle=solid} > @Test > public void testParsePdf_Content_Memory() throws Exception { > { > InputStream inputStream = new > FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf"); > try { > StringWriter writer = new StringWriter(); >FileWriter fileWriter = new FileWriter(new > File("c:/tmp/test.txt")); > BodyContentHandler handler = new BodyContentHandler(fileWriter); > Metadata metadata = new Metadata(); > new PDFParser().parse(inputStream, handler, metadata, new > ParseContext()); > fileWriter.close(); > } finally { > inputStream.close(); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.
[ https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845590#comment-17845590 ] Tilman Hausherr edited comment on TIKA-4254 at 5/12/24 9:40 AM: THausherr commented on PR #1754: URL: https://github.com/apache/tika/pull/1754#issuecomment-2105679546 Maybe I get it: {{repo = config.getMimeRepository();}} isn't creating anything new, it's retrieving something that is changed later by the test? If my understanding is correct then it's a deeper problem. was (Author: githubbot): THausherr commented on PR #1754: URL: https://github.com/apache/tika/pull/1754#issuecomment-2105679546 Maybe I get it: `repo = config.getMimeRepository();` isn't creating anything new, it's retrieving something that is changed later by the test? If my understanding is correct then it's a deeper problem. > The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the > first run and fails in repeated runs in the same environment. > > > Key: TIKA-4254 > URL: https://issues.apache.org/jira/browse/TIKA-4254 > Project: Tika > Issue Type: Bug >Reporter: Kaiyao Ke >Priority: Major > > ### Brief Description of the Bug > The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the > first run but fails in the second run in the same environment. The source of > the problem is that each test execution initializes a new media type > (`MimeType`) instance `testType` (same problem for `testType2`), and all > media types across different test executions attempt to use the same name > pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of > the test, the line `this.repo.addPattern(testType, pattern, true);` will > throw an error, since the name pattern is already used by the `testType` > instance initiated from the first test execution. Specifically, in the second > run, the `addGlob()` method of the `Pattern` class will assert conflict > patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`). > ### Failure Message in the 2nd Test Run: > ``` > org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: > rtg_sst_grb_0\.5\.\d{8} > at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123) > at org.apache.tika.mime.Patterns.add(Patterns.java:71) > at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450) > at > org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) > ``` > ### Reproduce > Use the `NIOInspector` plugin that supports rerunning individual tests in the > same environment: > ``` > cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package > mvn edu.illinois:NIOInspector:rerun > -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex > ``` > ### Proposed Fix > Declare `testType` and `testType2` as static variables and initialize them at > class loading time. Therefore, repeated runs of `testJavaRegex()` will not > conflict each other. All tests pass and are idempotent after the fix. > ### Necessity of Fix > A fix is recommended as unit tests shall be idempotent, and state pollution > shall be mitigated so that newly introduced tests do not fail in the future > due to polluted shared states. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.
[ https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845566#comment-17845566 ] Tilman Hausherr commented on TIKA-4254: --- Why would we ever run the test twice in the same environment? > The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the > first run and fails in repeated runs in the same environment. > > > Key: TIKA-4254 > URL: https://issues.apache.org/jira/browse/TIKA-4254 > Project: Tika > Issue Type: Bug >Reporter: Kaiyao Ke >Priority: Major > > ### Brief Description of the Bug > The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the > first run but fails in the second run in the same environment. The source of > the problem is that each test execution initializes a new media type > (`MimeType`) instance `testType` (same problem for `testType2`), and all > media types across different test executions attempt to use the same name > pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of > the test, the line `this.repo.addPattern(testType, pattern, true);` will > throw an error, since the name pattern is already used by the `testType` > instance initiated from the first test execution. Specifically, in the second > run, the `addGlob()` method of the `Pattern` class will assert conflict > patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`). > ### Failure Message in the 2nd Test Run: > ``` > org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: > rtg_sst_grb_0\.5\.\d{8} > at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123) > at org.apache.tika.mime.Patterns.add(Patterns.java:71) > at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450) > at > org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) > ``` > ### Reproduce > Use the `NIOInspector` plugin that supports rerunning individual tests in the > same environment: > ``` > cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package > mvn edu.illinois:NIODetector:rerun > -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex > ``` > ### Proposed Fix > Declare `testType` and `testType2` as static variables and initialize them at > class loading time. Therefore, repeated runs of `testJavaRegex()` will not > conflict each other. All tests pass and are idempotent after the fix. > ### Necessity of Fix > A fix is recommended as unit tests shall be idempotent, and state pollution > shall be mitigated so that newly introduced tests do not fail in the future > due to polluted shared states. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4245) Tika does not get html content properly
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840922#comment-17840922 ] Tilman Hausherr commented on TIKA-4245: --- The file claims to be utf-16 but it isn't. If I change it to utf-8 in the editor then I get an NPE in the GUI. > Tika does not get html content properly > > > Key: TIKA-4245 > URL: https://issues.apache.org/jira/browse/TIKA-4245 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: Sample html file and tika config xml.zip > > > We use org.apache.tika.parser.AutoDetectParser to get the content of html > files. And we found out that it does not get the content fo the sample file > properly. > Following is the sample code and attached is the tika-config.xml and the > sample html file. The content extracted with Tika reads > "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different > from the native file. > > > The operating system is Ubuntu 20.04. Java version is 21. Tika version is > 2.9.2. > {code:java} > import org.apache.commons.io.FileUtils; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.sax.BodyContentHandler; > > import java.io.File; > import java.io.FileInputStream; > import java.io.PrintWriter; > import java.nio.file.Files; > import java.nio.file.Path; > import java.nio.file.Paths; > > public class ExtractTxtFromHtml { > private static final Path inputFile = new > File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); > > public static void main(String args[]) { > extactText(false); > extactText(true); > } > > static void extactText(boolean largeFile) { > PrintWriter outputFileWriter = null; > try { > BodyContentHandler handler; > Path outputFilePath = null; > > if (largeFile) { > // write tika output to disk > outputFilePath = > Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); > outputFileWriter = new > PrintWriter(Files.newOutputStream(outputFilePath)); > handler = new BodyContentHandler(outputFileWriter); > } else { > // stream it in memory > handler = new BodyContentHandler(-1); > } > > Metadata metadata = new Metadata(); > FileInputStream inputData = new > FileInputStream(inputFile.toFile()); > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml"); > Parser autoDetectParser = new AutoDetectParser(config); > ParseContext context = new ParseContext(); > context.set(TikaConfig.class, config); > autoDetectParser.parse(inputData, handler, metadata, context); > > String content; > if (largeFile) { > content = FileUtils.readFileToString(outputFilePath.toFile()); > } > else { > content = handler.toString(); > } > System.out.println("content = " + content); > } > catch(Exception ex) { > ex.printStackTrace(); > } finally { > if (outputFileWriter != null) { > outputFileWriter.close(); > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4245) Tika does not get html content properly
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840908#comment-17840908 ] Tilman Hausherr commented on TIKA-4245: --- Happens also with the tika app GUI. > Tika does not get html content properly > > > Key: TIKA-4245 > URL: https://issues.apache.org/jira/browse/TIKA-4245 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: Sample html file and tika config xml.zip > > > We use org.apache.tika.parser.AutoDetectParser to get the content of html > files. And we found out that it does not get the content fo the sample file > properly. > Following is the sample code and attached is the tika-config.xml and the > sample html file. The content extracted with Tika reads > "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different > from the native file. > > > The operating system is Ubuntu 20.04. Java version is 21. Tika version is > 2.9.2. > {code:java} > import org.apache.commons.io.FileUtils; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.sax.BodyContentHandler; > > import java.io.File; > import java.io.FileInputStream; > import java.io.PrintWriter; > import java.nio.file.Files; > import java.nio.file.Path; > import java.nio.file.Paths; > > public class ExtractTxtFromHtml { > private static final Path inputFile = new > File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); > > public static void main(String args[]) { > extactText(false); > extactText(true); > } > > static void extactText(boolean largeFile) { > PrintWriter outputFileWriter = null; > try { > BodyContentHandler handler; > Path outputFilePath = null; > > if (largeFile) { > // write tika output to disk > outputFilePath = > Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); > outputFileWriter = new > PrintWriter(Files.newOutputStream(outputFilePath)); > handler = new BodyContentHandler(outputFileWriter); > } else { > // stream it in memory > handler = new BodyContentHandler(-1); > } > > Metadata metadata = new Metadata(); > FileInputStream inputData = new > FileInputStream(inputFile.toFile()); > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml"); > Parser autoDetectParser = new AutoDetectParser(config); > ParseContext context = new ParseContext(); > context.set(TikaConfig.class, config); > autoDetectParser.parse(inputData, handler, metadata, context); > > String content; > if (largeFile) { > content = FileUtils.readFileToString(outputFilePath.toFile()); > } > else { > content = handler.toString(); > } > System.out.println("content = " + content); > } > catch(Exception ex) { > ex.printStackTrace(); > } finally { > if (outputFileWriter != null) { > outputFileWriter.close(); > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4245) Tika does not get html content properly
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4245: -- Description: We use org.apache.tika.parser.AutoDetectParser to get the content of html files. And we found out that it does not get the content fo the sample file properly. Following is the sample code and attached is the tika-config.xml and the sample html file. The content extracted with Tika reads "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different from the native file. The operating system is Ubuntu 20.04. Java version is 21. Tika version is 2.9.2. {code:java} import org.apache.commons.io.FileUtils; import org.apache.tika.config.TikaConfig; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import java.io.File; import java.io.FileInputStream; import java.io.PrintWriter; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; public class ExtractTxtFromHtml { private static final Path inputFile = new File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); public static void main(String args[]) { extactText(false); extactText(true); } static void extactText(boolean largeFile) { PrintWriter outputFileWriter = null; try { BodyContentHandler handler; Path outputFilePath = null; if (largeFile) { // write tika output to disk outputFilePath = Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); outputFileWriter = new PrintWriter(Files.newOutputStream(outputFilePath)); handler = new BodyContentHandler(outputFileWriter); } else { // stream it in memory handler = new BodyContentHandler(-1); } Metadata metadata = new Metadata(); FileInputStream inputData = new FileInputStream(inputFile.toFile()); TikaConfig config = new TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml"); Parser autoDetectParser = new AutoDetectParser(config); ParseContext context = new ParseContext(); context.set(TikaConfig.class, config); autoDetectParser.parse(inputData, handler, metadata, context); String content; if (largeFile) { content = FileUtils.readFileToString(outputFilePath.toFile()); } else { content = handler.toString(); } System.out.println("content = " + content); } catch(Exception ex) { ex.printStackTrace(); } finally { if (outputFileWriter != null) { outputFileWriter.close(); } } } } {code} was: We use org.apache.tika.parser.AutoDetectParser to get the content of html files. And we found out that it does not get the content fo the sample file properly. Following is the sample code and attached is the tika-config.xml and the sample html file. The content extracted with Tika reads "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different from the native file. The operating system is Ubuntu 20.04. Java version is 21. Tika version is 2.9.2. import org.apache.commons.io.FileUtils; import org.apache.tika.config.TikaConfig; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import java.io.File; import java.io.FileInputStream; import java.io.PrintWriter; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; public class ExtractTxtFromHtml { private static final Path inputFile = new File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); public static void main(String args[]) { extactText(false); extactText(true); } static void extactText(boolean largeFile) { PrintWriter outputFileWriter = null; try { BodyContentHandler handler; Path outputFilePath = null; if (largeFile) { // write tika output to disk outputFilePath = Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); outputFileWriter = new PrintWriter(Files.newOutputStream(outputFilePath)); handler = new BodyContentHandler(outputFileWriter); } else { // stream it in memory handler = new BodyContentHandler(-1);
[jira] [Comment Edited] (TIKA-4166) dependency updates for Tika 3.0
[ https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839745#comment-17839745 ] Tilman Hausherr edited comment on TIKA-4166 at 4/22/24 3:27 PM: It turned out to be something different than the missing package. After googling for the error message I found an SO answer that I had upvoted in the past https://stackoverflow.com/a/54467008/535646 was (Author: tilman): It turned out to be something different than the missing package. After googling for the error message I found an SO that I had upvoted in the past https://stackoverflow.com/a/54467008/535646 > dependency updates for Tika 3.0 > --- > > Key: TIKA-4166 > URL: https://issues.apache.org/jira/browse/TIKA-4166 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0-BETA > > > Separate ticket for updates for 3.0, especially those not found by dependabot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0
[ https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839745#comment-17839745 ] Tilman Hausherr commented on TIKA-4166: --- It turned out to be something different than the missing package. After googling for the error message I found an SO that I had upvoted in the past https://stackoverflow.com/a/54467008/535646 > dependency updates for Tika 3.0 > --- > > Key: TIKA-4166 > URL: https://issues.apache.org/jira/browse/TIKA-4166 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0-BETA > > > Separate ticket for updates for 3.0, especially those not found by dependabot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0
[ https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839652#comment-17839652 ] Tilman Hausherr commented on TIKA-4166: --- The latest Apache parent update means a javadoc update and it results in a failure on the ci: {noformat} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:3.6.3:aggregate (default-cli) on project tika: An error has occurred in Javadoc report generation: [ERROR] Exit code: 2 [ERROR] javadoc: error - No source files for package org.apache.tika.extractor [ERROR] Command line was: /usr/local/asfpackages/java/adoptium-jdk-11.0.16.1+1/bin/javadoc @options @packages {noformat} A possible cause for this could be that in tika-batch there is a test package that doesn't exist as a source package. It didn't happen locally for me because I didn't use "javadoc:aggregate". I'll do some more tests to see whether renaming the test package fixes this. > dependency updates for Tika 3.0 > --- > > Key: TIKA-4166 > URL: https://issues.apache.org/jira/browse/TIKA-4166 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0-BETA > > > Separate ticket for updates for 3.0, especially those not found by dependabot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4240) Change dependabot to weekly
[ https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836236#comment-17836236 ] Tilman Hausherr commented on TIKA-4240: --- I prefer daily but if more people feel pressured or annoyed by these mails (I never felt that way) then I accept weekly. > Change dependabot to weekly > --- > > Key: TIKA-4240 > URL: https://issues.apache.org/jira/browse/TIKA-4240 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tim Allison >Priority: Trivial > > On the list, I proposed this change. Some were in favor of dropping it back > to monthly. [~tilman] made the argument for the benefit of seeing problems > quickly and also acknowledged that it is a burden to merge the daily PRs. > I propose bumping dependabot back to weekly for a bit, and we'll see how it > works as a middle ground. > If anyone feels strongly about moving back to daily, we can do that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4240) Change dependabot to weekly
[ https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4240: -- Component/s: build > Change dependabot to weekly > --- > > Key: TIKA-4240 > URL: https://issues.apache.org/jira/browse/TIKA-4240 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tim Allison >Priority: Trivial > > On the list, I proposed this change. Some were in favor of dropping it back > to monthly. [~tilman] made the argument for the benefit of seeing problems > quickly and also acknowledged that it is a burden to merge the daily PRs. > I propose bumping dependabot back to weekly for a bit, and we'll see how it > works as a middle ground. > If anyone feels strongly about moving back to daily, we can do that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4240) Change dependabot to weekly
[ https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836224#comment-17836224 ] Tilman Hausherr commented on TIKA-4240: --- Not a burden (that was Eric, sort-of), I just don't have the time right now to fix the current build failure. I like the alerts, it's a low hanging fruit and also helps me to learn more about the code. > Change dependabot to weekly > --- > > Key: TIKA-4240 > URL: https://issues.apache.org/jira/browse/TIKA-4240 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > On the list, I proposed this change. Some were in favor of dropping it back > to monthly. [~tilman] made the argument for the benefit of seeing problems > quickly and also acknowledged that it is a burden to merge the daily PRs. > I propose bumping dependabot back to weekly for a bit, and we'll see how it > works as a middle ground. > If anyone feels strongly about moving back to daily, we can do that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4238) replace some deprecated code
[ https://issues.apache.org/jira/browse/TIKA-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834529#comment-17834529 ] Tilman Hausherr commented on TIKA-4238: --- This was a low-hanging fruit. I could also have done UnsynchronizedByteArrayInputStream, but replacing that one would not only would make the code much bigger, it would also require to catch an exception that isn't thrown now, so lets just wait what they do. https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/UnsynchronizedByteArrayInputStream.Builder.html#get() > replace some deprecated code > > > Key: TIKA-4238 > URL: https://issues.apache.org/jira/browse/TIKA-4238 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4238) replace some deprecated code
[ https://issues.apache.org/jira/browse/TIKA-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834529#comment-17834529 ] Tilman Hausherr edited comment on TIKA-4238 at 4/6/24 2:12 PM: --- This was a low-hanging fruit. I could also have done UnsynchronizedByteArrayInputStream, but replacing that one would not only make the code much bigger, it would also require to catch an exception that isn't thrown now, so lets just wait what they do in the future. https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/UnsynchronizedByteArrayInputStream.Builder.html#get() was (Author: tilman): This was a low-hanging fruit. I could also have done UnsynchronizedByteArrayInputStream, but replacing that one would not only would make the code much bigger, it would also require to catch an exception that isn't thrown now, so lets just wait what they do. https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/UnsynchronizedByteArrayInputStream.Builder.html#get() > replace some deprecated code > > > Key: TIKA-4238 > URL: https://issues.apache.org/jira/browse/TIKA-4238 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4218: -- Affects Version/s: 2.9.1 > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.1 >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.9.2 > > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4218. --- Assignee: Tim Allison Resolution: Fixed > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key
[ https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reassigned TIKA-4171: - Assignee: Tim Allison > Tika server only returns last value for PDFs that have multiple of the same > key > --- > > Key: TIKA-4171 > URL: https://issues.apache.org/jira/browse/TIKA-4171 > Project: Tika > Issue Type: Bug > Components: tika-server >Reporter: Cassandra Xia >Assignee: Tim Allison >Priority: Major > Fix For: 3.0.0-BETA, 2.9.2 > > Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert > FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, > testPDF_XFA_govdocs1_258578.pdf.html > > > Thanks for the great work on Tika server, it is the only OSS that can handle > Adobe's protected form format that FERC uses. > One problem that I'm hitting is that the FERC form that I am parsing has > multiple values for the same key name, e.g. in the screenshot below line 1-7 > all have the same key name. When Tika Server parses this PDF, it only returns > the value in row 7 (losing the previous 6 values). > My hunch is that somewhere in Tika Server, the values are getting stored in > some dictionary object, so the final value is the only survivor. Would it be > possible to return the extra values as a list from Tika Server? > Example PDF attached - thank you for taking a look! > !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4218: -- Fix Version/s: 2.9.2 > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.9.2 > > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key
[ https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4171. --- Resolution: Fixed > Tika server only returns last value for PDFs that have multiple of the same > key > --- > > Key: TIKA-4171 > URL: https://issues.apache.org/jira/browse/TIKA-4171 > Project: Tika > Issue Type: Bug > Components: tika-server >Reporter: Cassandra Xia >Priority: Major > Fix For: 2.9.2, 3.0.0-BETA > > Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert > FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, > testPDF_XFA_govdocs1_258578.pdf.html > > > Thanks for the great work on Tika server, it is the only OSS that can handle > Adobe's protected form format that FERC uses. > One problem that I'm hitting is that the FERC form that I am parsing has > multiple values for the same key name, e.g. in the screenshot below line 1-7 > all have the same key name. When Tika Server parses this PDF, it only returns > the value in row 7 (losing the previous 6 values). > My hunch is that somewhere in Tika Server, the values are getting stored in > some dictionary object, so the final value is the only survivor. Would it be > possible to return the extra values as a list from Tika Server? > Example PDF attached - thank you for taking a look! > !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4238) replace some deprecated code
[ https://issues.apache.org/jira/browse/TIKA-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4238. --- Resolution: Fixed > replace some deprecated code > > > Key: TIKA-4238 > URL: https://issues.apache.org/jira/browse/TIKA-4238 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4239) Update to 2.9.3
Tilman Hausherr created TIKA-4239: - Summary: Update to 2.9.3 Key: TIKA-4239 URL: https://issues.apache.org/jira/browse/TIKA-4239 Project: Tika Issue Type: Task Components: build Reporter: Tilman Hausherr -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4239) Update to 2.9.3
[ https://issues.apache.org/jira/browse/TIKA-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4239: -- Affects Version/s: 2.9.2 > Update to 2.9.3 > --- > > Key: TIKA-4239 > URL: https://issues.apache.org/jira/browse/TIKA-4239 > Project: Tika > Issue Type: Task > Components: build >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4162) Update to 2.9.2
[ https://issues.apache.org/jira/browse/TIKA-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4162. --- Assignee: Tilman Hausherr Resolution: Fixed > Update to 2.9.2 > --- > > Key: TIKA-4162 > URL: https://issues.apache.org/jira/browse/TIKA-4162 > Project: Tika > Issue Type: Task > Components: build >Affects Versions: 2.9.1 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.9.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4238) replace some deprecated code
Tilman Hausherr created TIKA-4238: - Summary: replace some deprecated code Key: TIKA-4238 URL: https://issues.apache.org/jira/browse/TIKA-4238 Project: Tika Issue Type: Task Affects Versions: 2.9.2 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Fix For: 3.0.0, 2.9.3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency
[ https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4236: -- Fix Version/s: 2.9.3 > tika-parser-nlp-module has an unnecessary Guava dependency > -- > > Key: TIKA-4236 > URL: https://issues.apache.org/jira/browse/TIKA-4236 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2 >Reporter: Manfred Baedke >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > This should be avoided, because it's prone to maintenance and security > problems. > It's easy to get rid of it: the class > {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses > {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson > anyway, it could just be replaced with > {{{}com.google.gson.reflect.TypeToken{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency
[ https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4236: -- Fix Version/s: (was: 2.9.2) > tika-parser-nlp-module has an unnecessary Guava dependency > -- > > Key: TIKA-4236 > URL: https://issues.apache.org/jira/browse/TIKA-4236 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2 >Reporter: Manfred Baedke >Priority: Major > Fix For: 3.0.0 > > > This should be avoided, because it's prone to maintenance and security > problems. > It's easy to get rid of it: the class > {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses > {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson > anyway, it could just be replaced with > {{{}com.google.gson.reflect.TypeToken{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency
[ https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4236. --- Assignee: Tilman Hausherr Resolution: Fixed > tika-parser-nlp-module has an unnecessary Guava dependency > -- > > Key: TIKA-4236 > URL: https://issues.apache.org/jira/browse/TIKA-4236 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2 >Reporter: Manfred Baedke >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > This should be avoided, because it's prone to maintenance and security > problems. > It's easy to get rid of it: the class > {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses > {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson > anyway, it could just be replaced with > {{{}com.google.gson.reflect.TypeToken{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency
[ https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4236: -- Fix Version/s: 2.9.2 3.0.0 > tika-parser-nlp-module has an unnecessary Guava dependency > -- > > Key: TIKA-4236 > URL: https://issues.apache.org/jira/browse/TIKA-4236 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2 >Reporter: Manfred Baedke >Priority: Major > Fix For: 2.9.2, 3.0.0 > > > This should be avoided, because it's prone to maintenance and security > problems. > It's easy to get rid of it: the class > {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses > {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson > anyway, it could just be replaced with > {{{}com.google.gson.reflect.TypeToken{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency
[ https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834385#comment-17834385 ] Tilman Hausherr commented on TIKA-4236: --- I found only a test dependency mentioned directly. It's still possible that guava is used as a dependency through some other dependency. The best would be you try with a snapshot. > tika-parser-nlp-module has an unnecessary Guava dependency > -- > > Key: TIKA-4236 > URL: https://issues.apache.org/jira/browse/TIKA-4236 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2 >Reporter: Manfred Baedke >Priority: Major > > This should be avoided, because it's prone to maintenance and security > problems. > It's easy to get rid of it: the class > {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses > {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson > anyway, it could just be replaced with > {{{}com.google.gson.reflect.TypeToken{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency
[ https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834282#comment-17834282 ] Tilman Hausherr commented on TIKA-4236: --- https://tika.apache.org/ "The Apache Tika PMC has set September 30, 2022 as the End Of Life for the Tika 1.x branch. The PMC will make security fixes for the 1.x branch until that date." > tika-parser-nlp-module has an unnecessary Guava dependency > -- > > Key: TIKA-4236 > URL: https://issues.apache.org/jira/browse/TIKA-4236 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2 >Reporter: Manfred Baedke >Priority: Major > > This should be avoided, because it's prone to maintenance and security > problems. > It's easy to get rid of it: the class > {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses > {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson > anyway, it could just be replaced with > {{{}com.google.gson.reflect.TypeToken{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency
[ https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834277#comment-17834277 ] Tilman Hausherr edited comment on TIKA-4236 at 4/5/24 12:21 PM: Is this what you had in mind? https://github.com/apache/tika/commit/1586e7281837850cfe36d6201d905edb2d77241f https://github.com/apache/tika/commit/920976d9f26695892ab503b8c93b213d9622d18e this is for the trunk, I can also do it for 2.9.3 but 1.* is no longer supported. Btw guava might still reappear in other tika components. was (Author: tilman): Is this what you had in mind? https://github.com/apache/tika/commit/1586e7281837850cfe36d6201d905edb2d77241f https://github.com/apache/tika/commit/920976d9f26695892ab503b8c93b213d9622d18e this is for the trunk, I can also do it for 2.9.3 but 1.* is no longer supported. > tika-parser-nlp-module has an unnecessary Guava dependency > -- > > Key: TIKA-4236 > URL: https://issues.apache.org/jira/browse/TIKA-4236 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2 >Reporter: Manfred Baedke >Priority: Major > > This should be avoided, because it's prone to maintenance and security > problems. > It's easy to get rid of it: the class > {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses > {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson > anyway, it could just be replaced with > {{{}com.google.gson.reflect.TypeToken{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency
[ https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834277#comment-17834277 ] Tilman Hausherr commented on TIKA-4236: --- Is this what you had in mind? https://github.com/apache/tika/commit/1586e7281837850cfe36d6201d905edb2d77241f https://github.com/apache/tika/commit/920976d9f26695892ab503b8c93b213d9622d18e this is for the trunk, I can also do it for 2.9.3 but 1.* is no longer supported. > tika-parser-nlp-module has an unnecessary Guava dependency > -- > > Key: TIKA-4236 > URL: https://issues.apache.org/jira/browse/TIKA-4236 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2 >Reporter: Manfred Baedke >Priority: Major > > This should be avoided, because it's prone to maintenance and security > problems. > It's easy to get rid of it: the class > {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses > {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson > anyway, it could just be replaced with > {{{}com.google.gson.reflect.TypeToken{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833807#comment-17833807 ] Tilman Hausherr commented on TIKA-4231: --- Yes it is text, but the PDF is using a feature that we don't support. Instead of having its own unicode for each glyph, it has the text extraction on a separate level. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833385#comment-17833385 ] Tilman Hausherr commented on TIKA-4231: --- No this is not being worked on. You'll have to use OCR. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832291#comment-17832291 ] Tilman Hausherr commented on TIKA-4231: --- I have attached an extraction with pdfbox 2.0.31: [^arabic-pdfbox.txt] is this better, or not? I've added a BOM and removed the 00 bytes. In the tika extraction there are many "ef bf bd" bytes instead which is the utf8 replacement character �. A possible explanation why Adobe Reader works better is that this file uses the "ActualText"-feature which PDFBox doesn't support (PDFBOX-3248). > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4231: -- Attachment: arabic-pdfbox.txt > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832284#comment-17832284 ] Tilman Hausherr commented on TIKA-4231: --- This doesn't change my argument. The latest version is 2.9.1, please try with that one. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0, it produces gibberish characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832258#comment-17832258 ] Tilman Hausherr commented on TIKA-4231: --- The current tika version is 2.9.1, soon to be 2.9.2. There is no "PDFBox version 2.6.0". > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using PDFBox version 2.6.0, it produces gibberish characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf
[ https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4228: -- Affects Version/s: 2.9.0 > Tika parser crashes JVM when it gets metadata and embedded objects from pdf > --- > > Key: TIKA-4228 > URL: https://issues.apache.org/jira/browse/TIKA-4228 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Xiaohong Yang >Priority: Major > Attachments: tika-config-and-sample-file.zip > > > [^tika-config-and-sample-file.zip] > > We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded > objects from pdf documents. And we found out that it crashes the program (or > the JVM) when it gets metadata and embedded files from the sample pdf file. > > Following is the sample code and attached is the tika-config.xml and the > sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 > runs in our production environment. Sometimes it happens when it gets > metadata and sometimes it happens when it extracts embedded files (the > chances are about 50/50). > > The operating system is Ubuntu 20.04. Java version is 21. Tika version is > 2.9.0 and POI version is 5.2.3. > > > import org.apache.pdfbox.io.IOUtils; > import org.apache.poi.poifs.filesystem.DirectoryEntry; > import org.apache.poi.poifs.filesystem.DocumentEntry; > import org.apache.poi.poifs.filesystem.DocumentInputStream; > import org.apache.poi.poifs.filesystem.POIFSFileSystem; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.detect.Detector; > import org.apache.tika.extractor.EmbeddedDocumentExtractor; > import org.apache.tika.io.FilenameUtils; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.mime.MediaType; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.sax.BodyContentHandler; > import org.xml.sax.ContentHandler; > import org.xml.sax.SAXException; > import org.xml.sax.helpers.DefaultHandler; > > import java.io.*; > import java.net.URL; > import java.nio.file.Files; > import java.nio.file.Path; > import java.nio.file.Paths; > > public class ProcessPdf { > private final Path inputFile = new > File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath(); > private final Path outputDir = new > File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath(); > > private Parser parser; > private ParseContext context; > > > public static void main(String args[]) { > try > { System.out.println("Start"); ProcessPdf processPdf > = new ProcessPdf(); System.out.println("Get metadata"); > processPdf.getMataData(); System.out.println("Extract embedded > files"); processPdf.extract(); > System.out.println("End"); } > catch(Exception ex) > { ex.printStackTrace(); } > } > > public ProcessPdf() > { } > > public void getMataData() throws Exception { > BodyContentHandler handler = new BodyContentHandler(-1); > > Metadata metadata = new Metadata(); > try (FileInputStream inputData = new > FileInputStream(inputFile.toString())) > { TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml"); > Parser autoDetectParser = new AutoDetectParser(config); > ParseContext context = new ParseContext(); > context.set(TikaConfig.class, config); > autoDetectParser.parse(inputData, handler, metadata, context); } > > String content = handler.toString(); > } > > public void extract() throws Exception { > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml"); > ProcessPdf.FileEmbeddedDocumentExtractor > fileEmbeddedDocumentExtractor = new > ProcessPdf.FileEmbeddedDocumentExtractor(); > > parser = new AutoDetectParser(config); > context = new ParseContext(); > context.set(Parser.class, parser); > context.set(TikaConfig.class, config); > context.set(EmbeddedDocumentExtractor.class, > fileEmbeddedDocumentExtractor); > > URL url = inputFile.toUri().toURL(); > Metadata metadata = new Metadata(); > try (InputStream input = TikaInputStream.get(url, metadata)) > { ContentHandler handler = new DefaultHandler(); > parser.parse(input, handler, metadata, context); } > } > > private class
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830954#comment-17830954 ] Tilman Hausherr commented on TIKA-4218: --- 6FOMNUPGPA6IG66Z4NIUEQIVOR5ON46Q (an MP4 file) has a loss of metadata (bierenbach: 2 | earlier: 2 | https://www.facebook.com/speedlinecablecam: 2 | https://www.speedline-cablecam.com: 2 | in: 2 | of: 2 | the: 2 | this: 2 | woods: 2 | year: 2) EEXR753OKDGYAIXL36PZ2EGYPN477SZU and a few other files have one word in TOP_10_MORE_IN_A which reappears in TOP_10_MORE_IN_B but with "oebps". Here, "secretary" becomes "secretaryoebps". I don't know if this is a bug or not. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830604#comment-17830604 ] Tilman Hausherr commented on TIKA-4218: --- To be honest I didn't look further, because these problems affected too many files. Yes please rerun the test so that whatever remains would stick out. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key
[ https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830110#comment-17830110 ] Tilman Hausherr edited comment on TIKA-4171 at 3/23/24 5:50 PM: We have a regression with the file [^876503.pdf] in the XFAExtractor class. What happens is that {{displayFieldName}} is now lost if {{fieldValues}} is empty. Because of that, the text "Enter the full name of the conveying party or parties" is missing for the field "conname1". I'm not saying that this is wrong, I just wonder if this is intended. was (Author: tilman): We have a regression with the file [^876503.pdf] in the XFAExtractor class. What happens is that {{displayFieldName}} is now lost if {{fieldValues}} is empty. Because of that, the text "Enter the full name of the conveying party or parties" is missing for field the "conname1". I'm not saying that this is wrong, I just wonder if this is intended. > Tika server only returns last value for PDFs that have multiple of the same > key > --- > > Key: TIKA-4171 > URL: https://issues.apache.org/jira/browse/TIKA-4171 > Project: Tika > Issue Type: Bug > Components: tika-server >Reporter: Cassandra Xia >Priority: Major > Fix For: 3.0.0-BETA, 2.9.2 > > Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert > FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, > testPDF_XFA_govdocs1_258578.pdf.html > > > Thanks for the great work on Tika server, it is the only OSS that can handle > Adobe's protected form format that FERC uses. > One problem that I'm hitting is that the FERC form that I am parsing has > multiple values for the same key name, e.g. in the screenshot below line 1-7 > all have the same key name. When Tika Server parses this PDF, it only returns > the value in row 7 (losing the previous 6 values). > My hunch is that somewhere in Tika Server, the values are getting stored in > some dictionary object, so the final value is the only survivor. Would it be > possible to return the extra values as a list from Tika Server? > Example PDF attached - thank you for taking a look! > !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key
[ https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4171: -- Attachment: testPDF_XFA_govdocs1_258578.pdf.html > Tika server only returns last value for PDFs that have multiple of the same > key > --- > > Key: TIKA-4171 > URL: https://issues.apache.org/jira/browse/TIKA-4171 > Project: Tika > Issue Type: Bug > Components: tika-server >Reporter: Cassandra Xia >Priority: Major > Fix For: 3.0.0-BETA, 2.9.2 > > Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert > FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, > testPDF_XFA_govdocs1_258578.pdf.html > > > Thanks for the great work on Tika server, it is the only OSS that can handle > Adobe's protected form format that FERC uses. > One problem that I'm hitting is that the FERC form that I am parsing has > multiple values for the same key name, e.g. in the screenshot below line 1-7 > all have the same key name. When Tika Server parses this PDF, it only returns > the value in row 7 (losing the previous 6 values). > My hunch is that somewhere in Tika Server, the values are getting stored in > some dictionary object, so the final value is the only survivor. Would it be > possible to return the extra values as a list from Tika Server? > Example PDF attached - thank you for taking a look! > !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key
[ https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830113#comment-17830113 ] Tilman Hausherr commented on TIKA-4171: --- Proposed change: add these 3 lines before the last one in this code segment {code:java} if (fieldValues.length == 0) { fieldValues = new String[]{""}; } for (String fieldValue : fieldValues) { {code} This is the result of the file from testXFAExtractionBasic (testPDF_XFA_govdocs1_258578.pdf) which now has 27 fields instead of 24: [^testPDF_XFA_govdocs1_258578.pdf.html] > Tika server only returns last value for PDFs that have multiple of the same > key > --- > > Key: TIKA-4171 > URL: https://issues.apache.org/jira/browse/TIKA-4171 > Project: Tika > Issue Type: Bug > Components: tika-server >Reporter: Cassandra Xia >Priority: Major > Fix For: 3.0.0-BETA, 2.9.2 > > Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert > FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, > testPDF_XFA_govdocs1_258578.pdf.html > > > Thanks for the great work on Tika server, it is the only OSS that can handle > Adobe's protected form format that FERC uses. > One problem that I'm hitting is that the FERC form that I am parsing has > multiple values for the same key name, e.g. in the screenshot below line 1-7 > all have the same key name. When Tika Server parses this PDF, it only returns > the value in row 7 (losing the previous 6 values). > My hunch is that somewhere in Tika Server, the values are getting stored in > some dictionary object, so the final value is the only survivor. Would it be > possible to return the extra values as a list from Tika Server? > Example PDF attached - thank you for taking a look! > !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key
[ https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830110#comment-17830110 ] Tilman Hausherr commented on TIKA-4171: --- We have a regression with the file [^876503.pdf] in the XFAExtractor class. What happens is that {{displayFieldName}} is now lost if {{fieldValues}} is empty. Because of that, the text "Enter the full name of the conveying party or parties" is missing for field the "conname1". I'm not saying that this is wrong, I just wonder if this is intended. > Tika server only returns last value for PDFs that have multiple of the same > key > --- > > Key: TIKA-4171 > URL: https://issues.apache.org/jira/browse/TIKA-4171 > Project: Tika > Issue Type: Bug > Components: tika-server >Reporter: Cassandra Xia >Priority: Major > Fix For: 3.0.0-BETA, 2.9.2 > > Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert > FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png > > > Thanks for the great work on Tika server, it is the only OSS that can handle > Adobe's protected form format that FERC uses. > One problem that I'm hitting is that the FERC form that I am parsing has > multiple values for the same key name, e.g. in the screenshot below line 1-7 > all have the same key name. When Tika Server parses this PDF, it only returns > the value in row 7 (losing the previous 6 values). > My hunch is that somewhere in Tika Server, the values are getting stored in > some dictionary object, so the final value is the only survivor. Would it be > possible to return the extra values as a list from Tika Server? > Example PDF attached - thank you for taking a look! > !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key
[ https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4171: -- Attachment: 876503.pdf > Tika server only returns last value for PDFs that have multiple of the same > key > --- > > Key: TIKA-4171 > URL: https://issues.apache.org/jira/browse/TIKA-4171 > Project: Tika > Issue Type: Bug > Components: tika-server >Reporter: Cassandra Xia >Priority: Major > Fix For: 3.0.0-BETA, 2.9.2 > > Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert > FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png > > > Thanks for the great work on Tika server, it is the only OSS that can handle > Adobe's protected form format that FERC uses. > One problem that I'm hitting is that the FERC form that I am parsing has > multiple values for the same key name, e.g. in the screenshot below line 1-7 > all have the same key name. When Tika Server parses this PDF, it only returns > the value in row 7 (losing the previous 6 values). > My hunch is that somewhere in Tika Server, the values are getting stored in > some dictionary object, so the final value is the only survivor. Would it be > possible to return the extra values as a list from Tika Server? > Example PDF attached - thank you for taking a look! > !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830105#comment-17830105 ] Tilman Hausherr commented on TIKA-4218: --- Follow up in TIKA-4171 > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key
[ https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reopened TIKA-4171: --- > Tika server only returns last value for PDFs that have multiple of the same > key > --- > > Key: TIKA-4171 > URL: https://issues.apache.org/jira/browse/TIKA-4171 > Project: Tika > Issue Type: Bug > Components: tika-server >Reporter: Cassandra Xia >Priority: Major > Fix For: 3.0.0-BETA, 2.9.2 > > Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert > FINAL.pdf, example-output.txt, screenshot.png > > > Thanks for the great work on Tika server, it is the only OSS that can handle > Adobe's protected form format that FERC uses. > One problem that I'm hitting is that the FERC form that I am parsing has > multiple values for the same key name, e.g. in the screenshot below line 1-7 > all have the same key name. When Tika Server parses this PDF, it only returns > the value in row 7 (losing the previous 6 values). > My hunch is that somewhere in Tika Server, the values are getting stored in > some dictionary object, so the final value is the only survivor. Would it be > possible to return the extra values as a list from Tika Server? > Example PDF attached - thank you for taking a look! > !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830097#comment-17830097 ] Tilman Hausherr commented on TIKA-4218: --- Confirmed, I reverted just that change and then the text view is longer and ends with "Enter the total number of pages being submitted, including cover sheets, attachments, and documents:" and in current 2.9.2 it ends with "disclosure to GSA shall not be used to make determinations about individuals." > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830094#comment-17830094 ] Tilman Hausherr edited comment on TIKA-4218 at 3/23/24 3:59 PM: Oops, or it's part of XFA, I just found it too. Maybe related to the changes in TIKA-4171 in XFAExtractor? was (Author: tilman): Oops, or it's part of XFA, I just found it too. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830094#comment-17830094 ] Tilman Hausherr commented on TIKA-4218: --- Oops, or it's part of XFA, I just found it too. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830093#comment-17830093 ] Tilman Hausherr commented on TIKA-4218: --- I found one difference: "Enter the full name of the conveying party or parties" is in 2.9.1 but not in 2.9.2, and in 2.9.1 it appears directly after the main text. This text is in the first field (below "Name of conveying party(ies):") as /TU entry which one can get with {{getAlternateFieldName()}}. PDFBox and the PDF specification considers this to be an "alternative field name" and Adobe Reader displays it as popup when the mouse hovers there. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830079#comment-17830079 ] Tilman Hausherr commented on TIKA-4218: --- The word "party" appears 36 times in the json file, 18 times in my text extraction, but 62 times in the csv file in the TOP_N_TOKENS_A row. The double in the json file is because of "xfa_content", but the "62" I don't understand. Thanks for mentioning the new list (I probably missed it), I'll adjust my scripts and use them the next time. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218 ] Tilman Hausherr deleted comment on TIKA-4218: --- was (Author: tilman): There are also improvements not in my own test results, e.g. the "FOP" pdf file. Either something went wrong with my test, or with yours. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830071#comment-17830071 ] Tilman Hausherr commented on TIKA-4218: --- There are also improvements not in my own test results, e.g. the "FOP" pdf file. Either something went wrong with my test, or with yours. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830069#comment-17830069 ] Tilman Hausherr commented on TIKA-4218: --- Weird indeed, 876503.pdf didn't appear in the PDFBox regression tests: https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.30_vs_2.0.31.tar.xz I think we did make one (harmless) last minute change after the regression tests, so I just ran ExtractText with both versions and no difference. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4206) Variation on Zip Bomb
[ https://issues.apache.org/jira/browse/TIKA-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4206: -- Description: I see TIKA-216 which aims to prevent Zip bombs, but I'm seeing what looks like a bomb on 3.0.0 Beta. The zip bomb is a mime encoded attachment to an email, which may be why it isn't throwing an error. On my machine attempting to extract text (-J) the process continues infinitely (or at least 10 hours, which is when I stopped it). The actual file is embedded in a .gz file inside of an ARC file. However, extracting the attached .txt file produces the same error. The original ARC file is at: [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-004/warc/NARA-PEOT-2004-2004065521-04317-crawling-fast-c_NARA-PEOT-2004-2004101148-00173-crawling008.archive.org.arc.gz] was: I see Tika-216 which aims to prevent Zip bombs, but I'm seeing what looks like a bomb on 3.0.0 Beta. The zip bomb is a mime encoded attachment to an email, which may be why it isn't throwing an error. On my machine attempting to extract text (-J) the process continues infinitely (or at least 10 hours, which is when I stopped it). The actual file is embedded in a .gz file inside of an ARC file. However, extracting the attached .txt file produces the same error. The original ARC file is at: https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-004/warc/NARA-PEOT-2004-2004065521-04317-crawling-fast-c_NARA-PEOT-2004-2004101148-00173-crawling008.archive.org.arc.gz > Variation on Zip Bomb > - > > Key: TIKA-4206 > URL: https://issues.apache.org/jira/browse/TIKA-4206 > Project: Tika > Issue Type: Bug >Affects Versions: 3.0.0-BETA >Reporter: Gregory Lepore >Priority: Major > Attachments: sample-42-mail-bomb.txt > > > I see TIKA-216 which aims to prevent Zip bombs, but I'm seeing what looks > like a bomb on 3.0.0 Beta. The zip bomb is a mime encoded attachment to an > email, which may be why it isn't throwing an error. > On my machine attempting to extract text (-J) the process continues > infinitely (or at least 10 hours, which is when I stopped it). > The actual file is embedded in a .gz file inside of an ARC file. However, > extracting the attached .txt file produces the same error. > > The original ARC file is at: > [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-004/warc/NARA-PEOT-2004-2004065521-04317-crawling-fast-c_NARA-PEOT-2004-2004101148-00173-crawling008.archive.org.arc.gz] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4214) Update apache compress in tika to 1.26+ for CVE-2024-26308.
[ https://issues.apache.org/jira/browse/TIKA-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4214. - Resolution: Duplicate Duplicate of TIKA-4199. > Update apache compress in tika to 1.26+ for CVE-2024-26308. > --- > > Key: TIKA-4214 > URL: https://issues.apache.org/jira/browse/TIKA-4214 > Project: Tika > Issue Type: Bug >Reporter: Dhoka Pramod >Priority: Major > > Need fix for [CVE-2024-26308|https://nvd.nist.gov/vuln/detail/CVE-2024-26308] > in tika latest version. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1
[ https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826996#comment-17826996 ] Tilman Hausherr commented on TIKA-4199: --- The original error you reported wasn't really a bug in commons compress, rather a change that more bytes were read than tika expected, see my first comment in COMPRESS-661. It resulted in several fixes in tika. > commons-compress 1.26.0 breaks Apache Tika 2.9.1 > > > Key: TIKA-4199 > URL: https://issues.apache.org/jira/browse/TIKA-4199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Alexander Veit >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.9.2, 3.0.0 > > > An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 > breaks Tika. > > For more information see https://issues.apache.org/jira/browse/COMPRESS-661. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (TIKA-4166) dependency updates for Tika 3.0
[ https://issues.apache.org/jira/browse/TIKA-4166 ] Tilman Hausherr deleted comment on TIKA-4166: --- was (Author: tilman): I've reverted it and will investigate / fix this later. Seems to be a problem with angus-activation. > dependency updates for Tika 3.0 > --- > > Key: TIKA-4166 > URL: https://issues.apache.org/jira/browse/TIKA-4166 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0-BETA > > > Separate ticket for updates for 3.0, especially those not found by dependabot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0
[ https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824953#comment-17824953 ] Tilman Hausherr commented on TIKA-4166: --- I've reverted it and will investigate / fix this later. Seems to be a problem with angus-activation. > dependency updates for Tika 3.0 > --- > > Key: TIKA-4166 > URL: https://issues.apache.org/jira/browse/TIKA-4166 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0-BETA > > > Separate ticket for updates for 3.0, especially those not found by dependabot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1
[ https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4199. --- Resolution: Fixed Commons-Compress has been updated to 1.26.1, I have reverted the workaround and a change that wasn't helpful. > commons-compress 1.26.0 breaks Apache Tika 2.9.1 > > > Key: TIKA-4199 > URL: https://issues.apache.org/jira/browse/TIKA-4199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Alexander Veit >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.9.2, 3.0.0 > > > An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 > breaks Tika. > > For more information see https://issues.apache.org/jira/browse/COMPRESS-661. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1
[ https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reassigned TIKA-4199: - Assignee: Tilman Hausherr > commons-compress 1.26.0 breaks Apache Tika 2.9.1 > > > Key: TIKA-4199 > URL: https://issues.apache.org/jira/browse/TIKA-4199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Alexander Veit >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.9.2, 3.0.0 > > > An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 > breaks Tika. > > For more information see https://issues.apache.org/jira/browse/COMPRESS-661. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4203) Add @deprecated annotation where needed
[ https://issues.apache.org/jira/browse/TIKA-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4203: -- Fix Version/s: 3.0.0 > Add @deprecated annotation where needed > --- > > Key: TIKA-4203 > URL: https://issues.apache.org/jira/browse/TIKA-4203 > Project: Tika > Issue Type: Task >Affects Versions: 3.0.0 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Trivial > Fix For: 3.0.0 > > > This is just to prevent my IDE from stopping from scrolling during the build. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4203) Add @deprecated annotation where needed
[ https://issues.apache.org/jira/browse/TIKA-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4203: -- Affects Version/s: 3.0.0 > Add @deprecated annotation where needed > --- > > Key: TIKA-4203 > URL: https://issues.apache.org/jira/browse/TIKA-4203 > Project: Tika > Issue Type: Task >Affects Versions: 3.0.0 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Trivial > > This is just to prevent my IDE from stopping from scrolling during the build. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4203) Add @deprecated annotation where needed
[ https://issues.apache.org/jira/browse/TIKA-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4203. --- Resolution: Fixed > Add @deprecated annotation where needed > --- > > Key: TIKA-4203 > URL: https://issues.apache.org/jira/browse/TIKA-4203 > Project: Tika > Issue Type: Task >Affects Versions: 3.0.0 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Trivial > Fix For: 3.0.0 > > > This is just to prevent my IDE from stopping from scrolling during the build. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4203) Add @deprecated annotation where needed
Tilman Hausherr created TIKA-4203: - Summary: Add @deprecated annotation where needed Key: TIKA-4203 URL: https://issues.apache.org/jira/browse/TIKA-4203 Project: Tika Issue Type: Task Reporter: Tilman Hausherr Assignee: Tilman Hausherr This is just to prevent my IDE from stopping from scrolling during the build. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1
[ https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4199: -- Fix Version/s: 2.9.2 3.0.0 > commons-compress 1.26.0 breaks Apache Tika 2.9.1 > > > Key: TIKA-4199 > URL: https://issues.apache.org/jira/browse/TIKA-4199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Alexander Veit >Priority: Major > Fix For: 2.9.2, 3.0.0 > > > An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 > breaks Tika. > > For more information see https://issues.apache.org/jira/browse/COMPRESS-661. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1
[ https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818937#comment-17818937 ] Tilman Hausherr commented on TIKA-4199: --- I tried an another solution {code:java} if (archive.markSupported()) { archive = new ArchiveInputStreamWrapper(archive); } {code} which also works. The wrapper delegates all except markSupported. I'll wait a few days if the commons compress people fix this. If not then I'll commit that solution. > commons-compress 1.26.0 breaks Apache Tika 2.9.1 > > > Key: TIKA-4199 > URL: https://issues.apache.org/jira/browse/TIKA-4199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Alexander Veit >Priority: Major > > An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 > breaks Tika. > > For more information see https://issues.apache.org/jira/browse/COMPRESS-661. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4201) Add hard limit to stream reading in IWorksParser#detectType
[ https://issues.apache.org/jira/browse/TIKA-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818873#comment-17818873 ] Tilman Hausherr commented on TIKA-4201: --- Yeah, makes sense. > Add hard limit to stream reading in IWorksParser#detectType > --- > > Key: TIKA-4201 > URL: https://issues.apache.org/jira/browse/TIKA-4201 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > TIKA-4199 showed us that we had been relying on hope that the detector would > only read a limited number of bytes when detecting in > IWorksParser#detectType. We should cache the first 1096 bytes, do the > detection, and then reset the wrapped archive stream. > We should not rely on hope. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1
[ https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818867#comment-17818867 ] Tilman Hausherr edited comment on TIKA-4199 at 2/20/24 3:37 PM: {quote}I'm not declaring this a problem with commons-compress! {quote} My bet was 51% it's with Tika but from the latest test code you inspired me to write in COMPRESS-661, it might be them or BufferedInputStream itself, or a misunderstanding how BufferedInputStream works. I also found another incomplete delegate class (BoundedInputStream), I'll complete that one too. was (Author: tilman): {quote}I'm not declaring this a problem with commons-compress! {quote} My bet was 51% it's with Tika but from the latest test code you inspired me to write in COMPRESS-661, it might be them or BufferedInputStream itself. I also found another incomplete delegate class (BoundedInputStream), I'll complete that one too. > commons-compress 1.26.0 breaks Apache Tika 2.9.1 > > > Key: TIKA-4199 > URL: https://issues.apache.org/jira/browse/TIKA-4199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Alexander Veit >Priority: Major > > An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 > breaks Tika. > > For more information see https://issues.apache.org/jira/browse/COMPRESS-661. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1
[ https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818867#comment-17818867 ] Tilman Hausherr commented on TIKA-4199: --- {quote}I'm not declaring this a problem with commons-compress! {quote} My bet was 51% it's with Tika but from the latest test code you inspired me to write in COMPRESS-661, it might be them or BufferedInputStream itself. I also found another incomplete delegate class (BoundedInputStream), I'll complete that one too. > commons-compress 1.26.0 breaks Apache Tika 2.9.1 > > > Key: TIKA-4199 > URL: https://issues.apache.org/jira/browse/TIKA-4199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Alexander Veit >Priority: Major > > An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 > breaks Tika. > > For more information see https://issues.apache.org/jira/browse/COMPRESS-661. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1
[ https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818823#comment-17818823 ] Tilman Hausherr commented on TIKA-4199: --- After merging I discovered that the SevenZWrapper class is incomplete (markSupported / mark / reset was missing, and many more). I tested reverting my one-line change, and some of the previously failing tests (e.g. the 7z tests) were now succeeding. So this kindof suggests that the cause is related to markSupported / mark / reset. If we ever find that cause, then the one-line change in {{PackageParser}} can be removed because it makes things slower. > commons-compress 1.26.0 breaks Apache Tika 2.9.1 > > > Key: TIKA-4199 > URL: https://issues.apache.org/jira/browse/TIKA-4199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Alexander Veit >Priority: Major > > An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 > breaks Tika. > > For more information see https://issues.apache.org/jira/browse/COMPRESS-661. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4200) Fix broken build after upgrade to commons-compress
[ https://issues.apache.org/jira/browse/TIKA-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4200. - Resolution: Duplicate Our CI is failing because of the CVE :-( Duplicate of TIKA-4199. I'm still working on it and I'm somewhat more optimistic than yesterday. > Fix broken build after upgrade to commons-compress > -- > > Key: TIKA-4200 > URL: https://issues.apache.org/jira/browse/TIKA-4200 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > > https://github.com/apache/tika/actions/runs/7955214068 > Looks like builds are failing after upgrade to commons compress. Apple (zip) > files are causing problems?! We should look into this and either fix our code > or, if there's a problem with commons-compress, identify it, report it and > downgrade to the last working version. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1
[ https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818774#comment-17818774 ] Tilman Hausherr edited comment on TIKA-4199 at 2/20/24 11:57 AM: - I'm working on it https://github.com/apache/tika/pull/1605 was (Author: tilman): I'm working on it https://github.com/apache/pdfbox/pull/180 > commons-compress 1.26.0 breaks Apache Tika 2.9.1 > > > Key: TIKA-4199 > URL: https://issues.apache.org/jira/browse/TIKA-4199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Alexander Veit >Priority: Major > > An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 > breaks Tika. > > For more information see https://issues.apache.org/jira/browse/COMPRESS-661. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1
[ https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818774#comment-17818774 ] Tilman Hausherr commented on TIKA-4199: --- I'm working on it https://github.com/apache/pdfbox/pull/180 > commons-compress 1.26.0 breaks Apache Tika 2.9.1 > > > Key: TIKA-4199 > URL: https://issues.apache.org/jira/browse/TIKA-4199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Alexander Veit >Priority: Major > > An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 > breaks Tika. > > For more information see https://issues.apache.org/jira/browse/COMPRESS-661. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3841) An exception occurred when parsing some word documents using tika, tika_exception
[ https://issues.apache.org/jira/browse/TIKA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-3841: -- Summary: An exception occurred when parsing some word documents using tika, tika_exception (was: An exception occurred when parsing some word documents using tikatika_exception) > An exception occurred when parsing some word documents using tika, > tika_exception > - > > Key: TIKA-3841 > URL: https://issues.apache.org/jira/browse/TIKA-3841 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24, 2.4.1, 1.28.4 > Environment: h3. Java Version > java version "1.8.0_291" > h3. OS Version > Linux localhost.localdomain 3.10.0-957.el7.x86_64 > [#1|https://github.com/elastic/elasticsearch/issues/1] SMP Thu Nov 8 23:39:32 > UTC 2018 x86_64 x86_64 x86_64 GNU/Linux >Reporter: lxz >Priority: Blocker > > { > "error": { > "root_cause": [ > { "type": "parse_exception", "reason": "Error parsing > document in field [content]" } > ], > "type": "parse_exception", > "reason": "Error parsing document in field [content]", > "caused_by": { > "type": "tika_exception", > "reason": "Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@3b5e180a", > "caused_by": > { "type": "array_index_out_of_bounds_exception", > "reason": "351" } > } > }, > "status": 400 > } -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3841) An exception occurred when parsing some word documents using tikatika_exception
[ https://issues.apache.org/jira/browse/TIKA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-3841: -- Summary: An exception occurred when parsing some word documents using tikatika_exception (was: 使用tika解析部分word文档出现异常,tika_exception) > An exception occurred when parsing some word documents using > tikatika_exception > --- > > Key: TIKA-3841 > URL: https://issues.apache.org/jira/browse/TIKA-3841 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24, 2.4.1, 1.28.4 > Environment: h3. Java Version > java version "1.8.0_291" > h3. OS Version > Linux localhost.localdomain 3.10.0-957.el7.x86_64 > [#1|https://github.com/elastic/elasticsearch/issues/1] SMP Thu Nov 8 23:39:32 > UTC 2018 x86_64 x86_64 x86_64 GNU/Linux >Reporter: lxz >Priority: Blocker > > { > "error": { > "root_cause": [ > { "type": "parse_exception", "reason": "Error parsing > document in field [content]" } > ], > "type": "parse_exception", > "reason": "Error parsing document in field [content]", > "caused_by": { > "type": "tika_exception", > "reason": "Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@3b5e180a", > "caused_by": > { "type": "array_index_out_of_bounds_exception", > "reason": "351" } > } > }, > "status": 400 > } -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4183) Update jackson-databind jar to 2.16.0 or higher (CVE-2023-35116)
[ https://issues.apache.org/jira/browse/TIKA-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4183. - Resolution: Duplicate duplicate of TIKA-4162, it was done there on 17.11.2023 in a8c3879a59a4a8ccb8522518a286ebef442c0f4c, the commits are all missing for some reason. It will be in 2.9.2. No, I don't know when this will be released. > Update jackson-databind jar to 2.16.0 or higher (CVE-2023-35116) > > > Key: TIKA-4183 > URL: https://issues.apache.org/jira/browse/TIKA-4183 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.1 >Reporter: Dhoka Pramod >Priority: Major > > Latest stable tika version 2.9.1 (in tika eval app) still has > jackson-databind-2.15.2. > It needs to be updated to 2.16.0 or higher to address > [https://nvd.nist.gov/vuln/detail/CVE-2023-35116] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4162) Update to 2.9.2
[ https://issues.apache.org/jira/browse/TIKA-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4162: -- Fix Version/s: 2.9.2 > Update to 2.9.2 > --- > > Key: TIKA-4162 > URL: https://issues.apache.org/jira/browse/TIKA-4162 > Project: Tika > Issue Type: Task > Components: build >Affects Versions: 2.9.1 >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 2.9.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4162) Update to 2.9.2
[ https://issues.apache.org/jira/browse/TIKA-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4162: -- Affects Version/s: 2.9.1 > Update to 2.9.2 > --- > > Key: TIKA-4162 > URL: https://issues.apache.org/jira/browse/TIKA-4162 > Project: Tika > Issue Type: Task > Components: build >Affects Versions: 2.9.1 >Reporter: Tilman Hausherr >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4172. - Resolution: Not A Bug > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4173) Fix dev version in main branch
[ https://issues.apache.org/jira/browse/TIKA-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796450#comment-17796450 ] Tilman Hausherr commented on TIKA-4173: --- It wasn't really a problem locally, I only had to change one line in one file and was able to test all updates. > Fix dev version in main branch > -- > > Key: TIKA-4173 > URL: https://issues.apache.org/jira/browse/TIKA-4173 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > Fix For: 3.0.0 > > > For some reason, and I should have caught this earlier, it looks like the > 3.0.0-SNAPSHOT version was not re-applied to the main branch after the > release. We need to fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4173) Fix dev version in main branch
[ https://issues.apache.org/jira/browse/TIKA-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796431#comment-17796431 ] Tilman Hausherr commented on TIKA-4173: --- I noticed that it didn't have the correct version, but I thought that was something just before the release. > Fix dev version in main branch > -- > > Key: TIKA-4173 > URL: https://issues.apache.org/jira/browse/TIKA-4173 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > > For some reason, and I should have caught this earlier, it looks like the > 3.0.0-SNAPSHOT version was not re-applied to the main branch after the > release. We need to fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789647#comment-17789647 ] Tilman Hausherr commented on TIKA-4172: --- Your file starts with 00 14 64 30. See also https://www.iana.org/assignments/media-types/application/applefile No I don't agree, because: what is a "binary" file after all? There is no fixed definition for this, it's just a file that hasn't been classified. > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789542#comment-17789542 ] Tilman Hausherr commented on TIKA-4172: --- application/octet-stream is defined as the default by the detection interface if it doesn't know. tika-mimetypes.xml does't seem to have anything magic that matches your file content. > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789318#comment-17789318 ] Tilman Hausherr commented on TIKA-4172: --- https://tika.apache.org/2.1.0/detection.html "Where the name of the file is known, it is sometimes possible to guess the file type from the name or extension. Within the tika-mimetypes.xml file is a list of patterns which are used to identify the type from the filename. However, because files may be renamed, this method of detection is quick but not always as accurate." > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788982#comment-17788982 ] Tilman Hausherr edited comment on TIKA-4172 at 11/23/23 5:05 AM: - Which tika call are you using? Have you tried detecting based on content only? was (Author: tilman): Which tika call are you using? Have you tried detecting purely on content? > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788982#comment-17788982 ] Tilman Hausherr commented on TIKA-4172: --- Which tika call are you using? Have you tried detecting purely on content? > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0
[ https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782915#comment-17782915 ] Tilman Hausherr commented on TIKA-4166: --- The zookeeper update worked locally, but not on the CI :-( > dependency updates for Tika 3.0 > --- > > Key: TIKA-4166 > URL: https://issues.apache.org/jira/browse/TIKA-4166 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0-BETA > > > Separate ticket for updates for 3.0, especially those not found by dependabot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4166) dependency updates for Tika 3.0
Tilman Hausherr created TIKA-4166: - Summary: dependency updates for Tika 3.0 Key: TIKA-4166 URL: https://issues.apache.org/jira/browse/TIKA-4166 Project: Tika Issue Type: Task Components: build Reporter: Tilman Hausherr Fix For: 3.0.0-BETA Separate ticket for updates for 3.0, especially those not found by dependabot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4162) Update to 2.9.2
Tilman Hausherr created TIKA-4162: - Summary: Update to 2.9.2 Key: TIKA-4162 URL: https://issues.apache.org/jira/browse/TIKA-4162 Project: Tika Issue Type: Task Components: build Reporter: Tilman Hausherr -- This message was sent by Atlassian Jira (v8.20.10#820010)