[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-24 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859718#comment-17859718
 ] 

Tilman Hausherr commented on TIKA-4251:
---

I'm wondering if this means lots of changes to check at the beginning. This is 
the kindof plugin that would be ideal for a supply chain attack.

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4270) wrong skew angle in tika-parser-ocr-module

2024-06-20 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4270:
--
Description: 
We use tika to extract text from different sources, including images with text 
that is rotated at a certain angle. To extract text from image with ocr, tika 
first deskew image. The skew angle is not calculated correctly. In example 
[^for_issue] (PNG file), the text is rotated at an angle of ~40 degrees. But 
the skew angle function 
(org.apache.tika.parser.ocr.tess4j.ImageDeskew#getSkewAngle) returns an angle 
of about 15. The slope angle calculation flag is enabled.

The documentation 
(https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#:~:text=To%20identify%20rotation)
 does not have sufficient information for this version of tika, there is a todo 
box and some relevant information for tika 1 (requires python and its 
libraries, but in the version of tika we use, angle calculations are 
implemented only using java)

  was:
We use tika to extract text from different sources, including images with text 
that is rotated at a certain angle. To extract text from image with ocr, tika 
first deskew image. The skew angle is not calculated correctly. In example 
[^for_issue] , the text is rotated at an angle of ~40 degrees. But the skew 
angle function (org.apache.tika.parser.ocr.tess4j.ImageDeskew#getSkewAngle) 
returns an angle of about 15. The slope angle calculation flag is enabled.

The documentation 
(https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#:~:text=To%20identify%20rotation)
 does not have sufficient information for this version of tika, there is a todo 
box and some relevant information for tika 1 (requires python and its 
libraries, but in the version of tika we use, angle calculations are 
implemented only using java)


> wrong skew angle in tika-parser-ocr-module
> --
>
> Key: TIKA-4270
> URL: https://issues.apache.org/jira/browse/TIKA-4270
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.1
>Reporter: Roman
>Priority: Major
> Attachments: for_issue
>
>
> We use tika to extract text from different sources, including images with 
> text that is rotated at a certain angle. To extract text from image with ocr, 
> tika first deskew image. The skew angle is not calculated correctly. In 
> example [^for_issue] (PNG file), the text is rotated at an angle of ~40 
> degrees. But the skew angle function 
> (org.apache.tika.parser.ocr.tess4j.ImageDeskew#getSkewAngle) returns an angle 
> of about 15. The slope angle calculation flag is enabled.
> The documentation 
> (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#:~:text=To%20identify%20rotation)
>  does not have sufficient information for this version of tika, there is a 
> todo box and some relevant information for tika 1 (requires python and its 
> libraries, but in the version of tika we use, angle calculations are 
> implemented only using java)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv

2024-06-10 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4267.
-
Resolution: Invalid

Closing for now, please comment and/or reopen if needed.

> Not getting correct mime type for a few file extensions. example: csv
> -
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.28.4
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv

2024-06-03 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4267:
--
Summary: Not getting correct mime type for a few file extensions. example: 
csv  (was: Not getting correct mimet type for few file extensions. example :csv)

> Not getting correct mime type for a few file extensions. example: csv
> -
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.28.4
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv

2024-06-03 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4267:
--
Affects Version/s: 1.28.4

> Not getting correct mimet type for few file extensions. example :csv
> 
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.28.4
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv

2024-06-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598
 ] 

Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:06 PM:


The current version is 2.9.2, please retry with that one.

Get the list of parsers with this code:
{code:java}
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
Map parsers = parser.getParsers(context);
Tika tika = new Tika();
System.out.println(tika.toString());
System.out.println("List of parsers: ");
int idx = 0;
for (Map.Entry p : parsers.entrySet())
{
MediaType t = p.getKey();
System.out.println((idx + 1) + ".- " + t.getType() + "/" + 
t.getSubtype());
++idx;
}
{code}


was (Author: tilman):
The current version is 2.9.2, please retry with that one.

> Not getting correct mimet type for few file extensions. example :csv
> 
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv

2024-06-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598
 ] 

Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:07 PM:


The current version is 2.9.2, please retry with that one; if it still doesn't 
work, please attach your csv file.

Get the list of parsers with this code:
{code:java}
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
Map parsers = parser.getParsers(context);
Tika tika = new Tika();
System.out.println(tika.toString());
System.out.println("List of parsers: ");
int idx = 0;
for (Map.Entry p : parsers.entrySet())
{
MediaType t = p.getKey();
System.out.println((idx + 1) + ".- " + t.getType() + "/" + 
t.getSubtype());
++idx;
}
{code}


was (Author: tilman):
The current version is 2.9.2, please retry with that one.

Get the list of parsers with this code:
{code:java}
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
Map parsers = parser.getParsers(context);
Tika tika = new Tika();
System.out.println(tika.toString());
System.out.println("List of parsers: ");
int idx = 0;
for (Map.Entry p : parsers.entrySet())
{
MediaType t = p.getKey();
System.out.println((idx + 1) + ".- " + t.getType() + "/" + 
t.getSubtype());
++idx;
}
{code}

> Not getting correct mime type for a few file extensions. example: csv
> -
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.28.4
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv

2024-06-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598
 ] 

Tilman Hausherr commented on TIKA-4267:
---

The current version is 2.9.2, please retry with that one.

> Not getting correct mimet type for few file extensions. example :csv
> 
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-1907) Big Pdf parsing to text - Out of memory

2024-05-15 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1907:
--
Fix Version/s: 3.0.0

> Big Pdf parsing to text - Out of memory
> ---
>
> Key: TIKA-1907
> URL: https://issues.apache.org/jira/browse/TIKA-1907
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Nicolas Daniels
>Priority: Major
> Fix For: 3.0.0
>
>
> Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284]
> I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe 
> PDFBox is not the appropriate lib to use in such case.
> Trying to read the same PDF using Tika leads to the same problem:
> {code:title=Test.java|borderStyle=solid}
> @Test
> public void testParsePdf_Content_Memory() throws Exception {
> {
> InputStream inputStream = new 
> FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
> try {
>  StringWriter writer = new StringWriter();
>FileWriter fileWriter = new FileWriter(new 
> File("c:/tmp/test.txt"));
>   BodyContentHandler handler = new BodyContentHandler(fileWriter);
>   Metadata metadata = new Metadata();
>   new PDFParser().parse(inputStream, handler, metadata, new 
> ParseContext());
>  fileWriter.close();
> } finally {
> inputStream.close();
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.

2024-05-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845590#comment-17845590
 ] 

Tilman Hausherr edited comment on TIKA-4254 at 5/12/24 9:40 AM:


THausherr commented on PR #1754:
URL: https://github.com/apache/tika/pull/1754#issuecomment-2105679546

   Maybe I get it: {{repo = config.getMimeRepository();}} isn't creating 
anything new, it's retrieving something that is changed later by the test? If 
my understanding is correct then it's a deeper problem.





was (Author: githubbot):
THausherr commented on PR #1754:
URL: https://github.com/apache/tika/pull/1754#issuecomment-2105679546

   Maybe I get it: `repo = config.getMimeRepository();` isn't creating anything 
new, it's retrieving something that is changed later by the test? If my 
understanding is correct then it's a deeper problem.




> The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the 
> first run and fails in repeated runs in the same environment. 
> 
>
> Key: TIKA-4254
> URL: https://issues.apache.org/jira/browse/TIKA-4254
> Project: Tika
>  Issue Type: Bug
>Reporter: Kaiyao Ke
>Priority: Major
>
> ### Brief Description of the Bug
> The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the 
> first run but fails in the second run in the same environment. The source of 
> the problem is that each test execution initializes a new media type 
> (`MimeType`) instance `testType` (same problem for `testType2`), and all 
> media types across different test executions attempt to use the same name 
> pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of 
> the test, the line `this.repo.addPattern(testType, pattern, true);` will 
> throw an error, since the name pattern is already used by the `testType` 
> instance initiated from the first test execution. Specifically, in the second 
> run, the `addGlob()` method of the `Pattern` class will assert conflict 
> patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`).
> ### Failure Message in the 2nd Test Run:
> ```
> org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: 
> rtg_sst_grb_0\.5\.\d{8}
>   at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
>   at org.apache.tika.mime.Patterns.add(Patterns.java:71)
>   at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
>   at 
> org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
> ```
> ### Reproduce
> Use the `NIOInspector` plugin that supports rerunning individual tests in the 
> same environment:
> ```
> cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
> mvn edu.illinois:NIOInspector:rerun 
> -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex
> ```
> ### Proposed Fix
> Declare `testType` and `testType2` as static variables and initialize them at 
> class loading time. Therefore, repeated runs of `testJavaRegex()` will not 
> conflict each other. All tests pass and are idempotent after the fix.
> ### Necessity of Fix
> A fix is recommended as unit tests shall be idempotent, and state pollution 
> shall be mitigated so that newly introduced tests do not fail in the future 
> due to polluted shared states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.

2024-05-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845566#comment-17845566
 ] 

Tilman Hausherr commented on TIKA-4254:
---

Why would we ever run the test twice in the same environment?

> The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the 
> first run and fails in repeated runs in the same environment. 
> 
>
> Key: TIKA-4254
> URL: https://issues.apache.org/jira/browse/TIKA-4254
> Project: Tika
>  Issue Type: Bug
>Reporter: Kaiyao Ke
>Priority: Major
>
> ### Brief Description of the Bug
> The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the 
> first run but fails in the second run in the same environment. The source of 
> the problem is that each test execution initializes a new media type 
> (`MimeType`) instance `testType` (same problem for `testType2`), and all 
> media types across different test executions attempt to use the same name 
> pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of 
> the test, the line `this.repo.addPattern(testType, pattern, true);` will 
> throw an error, since the name pattern is already used by the `testType` 
> instance initiated from the first test execution. Specifically, in the second 
> run, the `addGlob()` method of the `Pattern` class will assert conflict 
> patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`).
> ### Failure Message in the 2nd Test Run:
> ```
> org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: 
> rtg_sst_grb_0\.5\.\d{8}
>   at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
>   at org.apache.tika.mime.Patterns.add(Patterns.java:71)
>   at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
>   at 
> org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
> ```
> ### Reproduce
> Use the `NIOInspector` plugin that supports rerunning individual tests in the 
> same environment:
> ```
> cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
> mvn edu.illinois:NIODetector:rerun 
> -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex
> ```
> ### Proposed Fix
> Declare `testType` and `testType2` as static variables and initialize them at 
> class loading time. Therefore, repeated runs of `testJavaRegex()` will not 
> conflict each other. All tests pass and are idempotent after the fix.
> ### Necessity of Fix
> A fix is recommended as unit tests shall be idempotent, and state pollution 
> shall be mitigated so that newly introduced tests do not fail in the future 
> due to polluted shared states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840922#comment-17840922
 ] 

Tilman Hausherr commented on TIKA-4245:
---

The file claims to be utf-16 but it isn't. If I change it to utf-8 in the 
editor then I get an NPE in the GUI.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840908#comment-17840908
 ] 

Tilman Hausherr commented on TIKA-4245:
---

Happens also with the tika app GUI.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4245:
--
Description: 
We use org.apache.tika.parser.AutoDetectParser to get the content of html 
files.  And we found out that it does not get the content fo the sample file 
properly.

Following is the sample code and attached is the tika-config.xml and the sample 
html file.  The content extracted with Tika reads 
"㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
from the native file.

 

 

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2.   

 {code:java}
import org.apache.commons.io.FileUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
 
import java.io.File;
import java.io.FileInputStream;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
 
public class ExtractTxtFromHtml {
private static final Path inputFile = new 
File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
 
public static void main(String args[]) {
extactText(false);
extactText(true);
}
 
static void extactText(boolean largeFile) {
PrintWriter outputFileWriter = null;
try {
BodyContentHandler handler;
Path outputFilePath = null;
 
if (largeFile) {
// write tika output to disk
outputFilePath = 
Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
outputFileWriter = new 
PrintWriter(Files.newOutputStream(outputFilePath));
handler = new BodyContentHandler(outputFileWriter);
} else {
// stream it in memory
handler = new BodyContentHandler(-1);
}
 
Metadata metadata = new Metadata();
FileInputStream inputData = new FileInputStream(inputFile.toFile());
TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
Parser autoDetectParser = new AutoDetectParser(config);
ParseContext context = new ParseContext();
context.set(TikaConfig.class, config);
autoDetectParser.parse(inputData, handler, metadata, context);
 
String content;
if (largeFile) {
content = FileUtils.readFileToString(outputFilePath.toFile());
}
else {
content = handler.toString();
}
System.out.println("content = " + content);
}
catch(Exception ex) {
ex.printStackTrace();
} finally {
if (outputFileWriter != null) {
outputFileWriter.close();
}
}
}
}
{code}


  was:
We use org.apache.tika.parser.AutoDetectParser to get the content of html 
files.  And we found out that it does not get the content fo the sample file 
properly.

Following is the sample code and attached is the tika-config.xml and the sample 
html file.  The content extracted with Tika reads 
"㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
from the native file.

 

 

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2.   

 

import org.apache.commons.io.FileUtils;

import org.apache.tika.config.TikaConfig;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

 

import java.io.File;

import java.io.FileInputStream;

import java.io.PrintWriter;

import java.nio.file.Files;

import java.nio.file.Path;

import java.nio.file.Paths;

 

public class ExtractTxtFromHtml {

    private static final Path inputFile = new 
File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();

 

    public static void main(String args[]) {

    extactText(false);

    extactText(true);

    }

 

    static void extactText(boolean largeFile) {

    PrintWriter outputFileWriter = null;

    try {

    BodyContentHandler handler;

    Path outputFilePath = null;

 

    if (largeFile) {

    // write tika output to disk

    outputFilePath = 
Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");

    outputFileWriter = new 
PrintWriter(Files.newOutputStream(outputFilePath));

    handler = new BodyContentHandler(outputFileWriter);

    } else {

    // stream it in memory

    handler = new BodyContentHandler(-1);

  

[jira] [Comment Edited] (TIKA-4166) dependency updates for Tika 3.0

2024-04-22 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839745#comment-17839745
 ] 

Tilman Hausherr edited comment on TIKA-4166 at 4/22/24 3:27 PM:


It turned out to be something different than the missing package. After 
googling for the error message I found an SO answer that I had upvoted in the 
past 
https://stackoverflow.com/a/54467008/535646


was (Author: tilman):
It turned out to be something different than the missing package. After 
googling for the error message I found an SO that I had upvoted in the past 
https://stackoverflow.com/a/54467008/535646

> dependency updates for Tika 3.0
> ---
>
> Key: TIKA-4166
> URL: https://issues.apache.org/jira/browse/TIKA-4166
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0

2024-04-22 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839745#comment-17839745
 ] 

Tilman Hausherr commented on TIKA-4166:
---

It turned out to be something different than the missing package. After 
googling for the error message I found an SO that I had upvoted in the past 
https://stackoverflow.com/a/54467008/535646

> dependency updates for Tika 3.0
> ---
>
> Key: TIKA-4166
> URL: https://issues.apache.org/jira/browse/TIKA-4166
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0

2024-04-22 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839652#comment-17839652
 ] 

Tilman Hausherr commented on TIKA-4166:
---

The latest Apache parent update means a javadoc update and it results in a 
failure on the ci:
{noformat}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-javadoc-plugin:3.6.3:aggregate (default-cli) on 
project tika: An error has occurred in Javadoc report generation:
[ERROR] Exit code: 2
[ERROR] javadoc: error - No source files for package org.apache.tika.extractor
[ERROR] Command line was: 
/usr/local/asfpackages/java/adoptium-jdk-11.0.16.1+1/bin/javadoc @options 
@packages
{noformat}
A possible cause for this could be that in tika-batch there is a test package 
that doesn't exist as a source package. It didn't happen locally for me because 
I didn't use "javadoc:aggregate". I'll do some more tests to see whether 
renaming the test package fixes this.

> dependency updates for Tika 3.0
> ---
>
> Key: TIKA-4166
> URL: https://issues.apache.org/jira/browse/TIKA-4166
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4240) Change dependabot to weekly

2024-04-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836236#comment-17836236
 ] 

Tilman Hausherr commented on TIKA-4240:
---

I prefer daily but if more people feel pressured or annoyed by these mails (I 
never felt that way) then I accept weekly.

> Change dependabot to weekly
> ---
>
> Key: TIKA-4240
> URL: https://issues.apache.org/jira/browse/TIKA-4240
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tim Allison
>Priority: Trivial
>
> On the list, I proposed this change. Some were in favor of dropping it back 
> to monthly. [~tilman] made the argument for the benefit of seeing problems 
> quickly and also acknowledged that it is a burden to merge the daily PRs.
> I propose bumping dependabot back to weekly for a bit, and we'll see how it 
> works as a middle ground.
> If anyone feels strongly about moving back to daily, we can do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4240) Change dependabot to weekly

2024-04-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4240:
--
Component/s: build

> Change dependabot to weekly
> ---
>
> Key: TIKA-4240
> URL: https://issues.apache.org/jira/browse/TIKA-4240
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tim Allison
>Priority: Trivial
>
> On the list, I proposed this change. Some were in favor of dropping it back 
> to monthly. [~tilman] made the argument for the benefit of seeing problems 
> quickly and also acknowledged that it is a burden to merge the daily PRs.
> I propose bumping dependabot back to weekly for a bit, and we'll see how it 
> works as a middle ground.
> If anyone feels strongly about moving back to daily, we can do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4240) Change dependabot to weekly

2024-04-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836224#comment-17836224
 ] 

Tilman Hausherr commented on TIKA-4240:
---

Not a burden (that was Eric, sort-of), I just don't have the time right now to 
fix the current build failure. I like the alerts, it's a low hanging fruit and 
also helps me to learn more about the code.

> Change dependabot to weekly
> ---
>
> Key: TIKA-4240
> URL: https://issues.apache.org/jira/browse/TIKA-4240
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> On the list, I proposed this change. Some were in favor of dropping it back 
> to monthly. [~tilman] made the argument for the benefit of seeing problems 
> quickly and also acknowledged that it is a burden to merge the daily PRs.
> I propose bumping dependabot back to weekly for a bit, and we'll see how it 
> works as a middle ground.
> If anyone feels strongly about moving back to daily, we can do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4238) replace some deprecated code

2024-04-06 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834529#comment-17834529
 ] 

Tilman Hausherr commented on TIKA-4238:
---

This was a low-hanging fruit. I could also have done 
UnsynchronizedByteArrayInputStream, but replacing that one would not only would 
make the code much bigger, it would also require to catch an exception that 
isn't thrown now, so lets just wait what they do.
https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/UnsynchronizedByteArrayInputStream.Builder.html#get()

> replace some deprecated code
> 
>
> Key: TIKA-4238
> URL: https://issues.apache.org/jira/browse/TIKA-4238
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4238) replace some deprecated code

2024-04-06 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834529#comment-17834529
 ] 

Tilman Hausherr edited comment on TIKA-4238 at 4/6/24 2:12 PM:
---

This was a low-hanging fruit. I could also have done 
UnsynchronizedByteArrayInputStream, but replacing that one would not only make 
the code much bigger, it would also require to catch an exception that isn't 
thrown now, so lets just wait what they do in the future.
https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/UnsynchronizedByteArrayInputStream.Builder.html#get()


was (Author: tilman):
This was a low-hanging fruit. I could also have done 
UnsynchronizedByteArrayInputStream, but replacing that one would not only would 
make the code much bigger, it would also require to catch an exception that 
isn't thrown now, so lets just wait what they do.
https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/UnsynchronizedByteArrayInputStream.Builder.html#get()

> replace some deprecated code
> 
>
> Key: TIKA-4238
> URL: https://issues.apache.org/jira/browse/TIKA-4238
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4218:
--
Affects Version/s: 2.9.1

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.1
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.9.2
>
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4218.
---
  Assignee: Tim Allison
Resolution: Fixed

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reassigned TIKA-4171:
-

Assignee: Tim Allison

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Assignee: Tim Allison
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4218:
--
Fix Version/s: 2.9.2

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.9.2
>
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4171.
---
Resolution: Fixed

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 2.9.2, 3.0.0-BETA
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4238) replace some deprecated code

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4238.
---
Resolution: Fixed

> replace some deprecated code
> 
>
> Key: TIKA-4238
> URL: https://issues.apache.org/jira/browse/TIKA-4238
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4239) Update to 2.9.3

2024-04-06 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4239:
-

 Summary: Update to 2.9.3
 Key: TIKA-4239
 URL: https://issues.apache.org/jira/browse/TIKA-4239
 Project: Tika
  Issue Type: Task
  Components: build
Reporter: Tilman Hausherr






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4239) Update to 2.9.3

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4239:
--
Affects Version/s: 2.9.2

> Update to 2.9.3
> ---
>
> Key: TIKA-4239
> URL: https://issues.apache.org/jira/browse/TIKA-4239
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4162) Update to 2.9.2

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4162.
---
  Assignee: Tilman Hausherr
Resolution: Fixed

> Update to 2.9.2
> ---
>
> Key: TIKA-4162
> URL: https://issues.apache.org/jira/browse/TIKA-4162
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.1
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.9.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4238) replace some deprecated code

2024-04-06 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4238:
-

 Summary: replace some deprecated code
 Key: TIKA-4238
 URL: https://issues.apache.org/jira/browse/TIKA-4238
 Project: Tika
  Issue Type: Task
Affects Versions: 2.9.2
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
 Fix For: 3.0.0, 2.9.3






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency

2024-04-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4236:
--
Fix Version/s: 2.9.3

> tika-parser-nlp-module has an unnecessary Guava dependency
> --
>
> Key: TIKA-4236
> URL: https://issues.apache.org/jira/browse/TIKA-4236
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2
>Reporter: Manfred Baedke
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> This should be avoided, because it's prone to maintenance and security 
> problems.
> It's easy to get rid of it: the class 
> {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses 
> {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson 
> anyway, it could just be replaced with 
> {{{}com.google.gson.reflect.TypeToken{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency

2024-04-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4236:
--
Fix Version/s: (was: 2.9.2)

> tika-parser-nlp-module has an unnecessary Guava dependency
> --
>
> Key: TIKA-4236
> URL: https://issues.apache.org/jira/browse/TIKA-4236
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2
>Reporter: Manfred Baedke
>Priority: Major
> Fix For: 3.0.0
>
>
> This should be avoided, because it's prone to maintenance and security 
> problems.
> It's easy to get rid of it: the class 
> {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses 
> {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson 
> anyway, it could just be replaced with 
> {{{}com.google.gson.reflect.TypeToken{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency

2024-04-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4236.
---
  Assignee: Tilman Hausherr
Resolution: Fixed

> tika-parser-nlp-module has an unnecessary Guava dependency
> --
>
> Key: TIKA-4236
> URL: https://issues.apache.org/jira/browse/TIKA-4236
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2
>Reporter: Manfred Baedke
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> This should be avoided, because it's prone to maintenance and security 
> problems.
> It's easy to get rid of it: the class 
> {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses 
> {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson 
> anyway, it could just be replaced with 
> {{{}com.google.gson.reflect.TypeToken{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency

2024-04-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4236:
--
Fix Version/s: 2.9.2
   3.0.0

> tika-parser-nlp-module has an unnecessary Guava dependency
> --
>
> Key: TIKA-4236
> URL: https://issues.apache.org/jira/browse/TIKA-4236
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2
>Reporter: Manfred Baedke
>Priority: Major
> Fix For: 2.9.2, 3.0.0
>
>
> This should be avoided, because it's prone to maintenance and security 
> problems.
> It's easy to get rid of it: the class 
> {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses 
> {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson 
> anyway, it could just be replaced with 
> {{{}com.google.gson.reflect.TypeToken{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency

2024-04-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834385#comment-17834385
 ] 

Tilman Hausherr commented on TIKA-4236:
---

I found only a test dependency mentioned directly. It's still possible that 
guava is used as a dependency through some other dependency. The best would be 
you try with a snapshot.

> tika-parser-nlp-module has an unnecessary Guava dependency
> --
>
> Key: TIKA-4236
> URL: https://issues.apache.org/jira/browse/TIKA-4236
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2
>Reporter: Manfred Baedke
>Priority: Major
>
> This should be avoided, because it's prone to maintenance and security 
> problems.
> It's easy to get rid of it: the class 
> {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses 
> {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson 
> anyway, it could just be replaced with 
> {{{}com.google.gson.reflect.TypeToken{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency

2024-04-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834282#comment-17834282
 ] 

Tilman Hausherr commented on TIKA-4236:
---

https://tika.apache.org/
"The Apache Tika PMC has set September 30, 2022 as the End Of Life for the Tika 
1.x branch. The PMC will make security fixes for the 1.x branch until that 
date."

> tika-parser-nlp-module has an unnecessary Guava dependency
> --
>
> Key: TIKA-4236
> URL: https://issues.apache.org/jira/browse/TIKA-4236
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2
>Reporter: Manfred Baedke
>Priority: Major
>
> This should be avoided, because it's prone to maintenance and security 
> problems.
> It's easy to get rid of it: the class 
> {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses 
> {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson 
> anyway, it could just be replaced with 
> {{{}com.google.gson.reflect.TypeToken{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency

2024-04-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834277#comment-17834277
 ] 

Tilman Hausherr edited comment on TIKA-4236 at 4/5/24 12:21 PM:


Is this what you had in mind?
https://github.com/apache/tika/commit/1586e7281837850cfe36d6201d905edb2d77241f
https://github.com/apache/tika/commit/920976d9f26695892ab503b8c93b213d9622d18e
this is for the trunk, I can also do it for 2.9.3 but 1.* is no longer 
supported.
Btw guava might still reappear in other tika components.


was (Author: tilman):
Is this what you had in mind?
https://github.com/apache/tika/commit/1586e7281837850cfe36d6201d905edb2d77241f
https://github.com/apache/tika/commit/920976d9f26695892ab503b8c93b213d9622d18e
this is for the trunk, I can also do it for 2.9.3 but 1.* is no longer 
supported.

> tika-parser-nlp-module has an unnecessary Guava dependency
> --
>
> Key: TIKA-4236
> URL: https://issues.apache.org/jira/browse/TIKA-4236
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2
>Reporter: Manfred Baedke
>Priority: Major
>
> This should be avoided, because it's prone to maintenance and security 
> problems.
> It's easy to get rid of it: the class 
> {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses 
> {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson 
> anyway, it could just be replaced with 
> {{{}com.google.gson.reflect.TypeToken{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency

2024-04-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834277#comment-17834277
 ] 

Tilman Hausherr commented on TIKA-4236:
---

Is this what you had in mind?
https://github.com/apache/tika/commit/1586e7281837850cfe36d6201d905edb2d77241f
https://github.com/apache/tika/commit/920976d9f26695892ab503b8c93b213d9622d18e
this is for the trunk, I can also do it for 2.9.3 but 1.* is no longer 
supported.

> tika-parser-nlp-module has an unnecessary Guava dependency
> --
>
> Key: TIKA-4236
> URL: https://issues.apache.org/jira/browse/TIKA-4236
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2
>Reporter: Manfred Baedke
>Priority: Major
>
> This should be avoided, because it's prone to maintenance and security 
> problems.
> It's easy to get rid of it: the class 
> {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses 
> {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson 
> anyway, it could just be replaced with 
> {{{}com.google.gson.reflect.TypeToken{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833807#comment-17833807
 ] 

Tilman Hausherr commented on TIKA-4231:
---

Yes it is text, but the PDF is using a feature that we don't support. Instead 
of having its own unicode for each glyph, it has the text extraction on a 
separate level.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-02 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833385#comment-17833385
 ] 

Tilman Hausherr commented on TIKA-4231:
---

No this is not being worked on. You'll have to use OCR.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832291#comment-17832291
 ] 

Tilman Hausherr commented on TIKA-4231:
---

I have attached an extraction with pdfbox 2.0.31:  [^arabic-pdfbox.txt] 
is this better, or not? I've added a BOM and removed the 00 bytes. In the tika 
extraction there are many "ef bf bd" bytes instead which is the utf8 
replacement character �.

A possible explanation why Adobe Reader works better is that this file uses the 
"ActualText"-feature which PDFBox doesn't support (PDFBOX-3248).

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4231:
--
Attachment: arabic-pdfbox.txt

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832284#comment-17832284
 ] 

Tilman Hausherr commented on TIKA-4231:
---

This doesn't change my argument. The latest version is 2.9.1, please try with 
that one.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0, it produces gibberish characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832258#comment-17832258
 ] 

Tilman Hausherr commented on TIKA-4231:
---

The current tika version is 2.9.1, soon to be 2.9.2. There is no "PDFBox 
version 2.6.0".

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using PDFBox version 2.6.0, it produces gibberish characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4228:
--
Affects Version/s: 2.9.0

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = inputFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata))
> {     ContentHandler handler = new DefaultHandler();     
> parser.parse(input, handler, metadata, context);     }
>     }
>  
>     private class 

[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-26 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830954#comment-17830954
 ] 

Tilman Hausherr commented on TIKA-4218:
---

6FOMNUPGPA6IG66Z4NIUEQIVOR5ON46Q (an MP4 file) has a loss of metadata 
(bierenbach: 2 | earlier: 2 | https://www.facebook.com/speedlinecablecam: 2 | 
https://www.speedline-cablecam.com: 2 | in: 2 | of: 2 | the: 2 | this: 2 | 
woods: 2 | year: 2)

EEXR753OKDGYAIXL36PZ2EGYPN477SZU and a few other files have one word in 
TOP_10_MORE_IN_A which reappears in TOP_10_MORE_IN_B but with "oebps". Here, 
"secretary" becomes "secretaryoebps". I don't know if this is a bug or not.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830604#comment-17830604
 ] 

Tilman Hausherr commented on TIKA-4218:
---

To be honest I didn't look further, because these problems affected too many 
files. Yes please rerun the test so that whatever remains would stick out.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830110#comment-17830110
 ] 

Tilman Hausherr edited comment on TIKA-4171 at 3/23/24 5:50 PM:


We have a regression with the file [^876503.pdf] in the XFAExtractor class. 
What happens is that {{displayFieldName}} is now lost if {{fieldValues}} is 
empty. Because of that, the text "Enter the full name of the conveying party or 
parties" is missing for the field "conname1".

I'm not saying that this is wrong, I just wonder if this is intended.


was (Author: tilman):
We have a regression with the file [^876503.pdf] in the XFAExtractor class. 
What happens is that {{displayFieldName}} is now lost if {{fieldValues}} is 
empty. Because of that, the text "Enter the full name of the conveying party or 
parties" is missing for field the "conname1".

I'm not saying that this is wrong, I just wonder if this is intended.

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-23 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4171:
--
Attachment: testPDF_XFA_govdocs1_258578.pdf.html

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830113#comment-17830113
 ] 

Tilman Hausherr commented on TIKA-4171:
---

Proposed change: add these 3 lines before the last one in this code segment
{code:java}
if (fieldValues.length == 0) {
fieldValues = new String[]{""};
}
for (String fieldValue : fieldValues) {
{code}
This is the result of the file from testXFAExtractionBasic 
(testPDF_XFA_govdocs1_258578.pdf) which now has 27 fields instead of 24:  
[^testPDF_XFA_govdocs1_258578.pdf.html] 

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830110#comment-17830110
 ] 

Tilman Hausherr commented on TIKA-4171:
---

We have a regression with the file [^876503.pdf] in the XFAExtractor class. 
What happens is that {{displayFieldName}} is now lost if {{fieldValues}} is 
empty. Because of that, the text "Enter the full name of the conveying party or 
parties" is missing for field the "conname1".

I'm not saying that this is wrong, I just wonder if this is intended.

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-23 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4171:
--
Attachment: 876503.pdf

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830105#comment-17830105
 ] 

Tilman Hausherr commented on TIKA-4218:
---

Follow up in  TIKA-4171

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-23 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reopened TIKA-4171:
---

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, example-output.txt, screenshot.png
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830097#comment-17830097
 ] 

Tilman Hausherr commented on TIKA-4218:
---

Confirmed, I reverted just that change and then the text view is longer and 
ends with "Enter the total number of pages being submitted, including cover 
sheets, attachments, and documents:" and in current 2.9.2 it ends with 
"disclosure to GSA shall not be used to make determinations about individuals."

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830094#comment-17830094
 ] 

Tilman Hausherr edited comment on TIKA-4218 at 3/23/24 3:59 PM:


Oops, or it's part of XFA, I just found it too. Maybe related to the changes in 
TIKA-4171 in XFAExtractor?


was (Author: tilman):
Oops, or it's part of XFA, I just found it too.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830094#comment-17830094
 ] 

Tilman Hausherr commented on TIKA-4218:
---

Oops, or it's part of XFA, I just found it too.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830093#comment-17830093
 ] 

Tilman Hausherr commented on TIKA-4218:
---

I found one difference: "Enter the full name of the conveying party or parties" 
is in 2.9.1 but not in 2.9.2, and in 2.9.1 it appears directly after the main 
text. This text is in the first field (below "Name of conveying party(ies):") 
as /TU entry which one can get with {{getAlternateFieldName()}}. PDFBox and the 
PDF specification considers this to be an "alternative field name" and Adobe 
Reader displays it as popup when the mouse hovers there.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830079#comment-17830079
 ] 

Tilman Hausherr commented on TIKA-4218:
---

The word "party" appears 36 times in the json file, 18 times in my text 
extraction, but 62 times in the csv file in the TOP_N_TOKENS_A row. The double 
in the json file is because of "xfa_content", but the "62" I don't understand.

Thanks for mentioning the new list (I probably missed it), I'll adjust my 
scripts and use them the next time.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ https://issues.apache.org/jira/browse/TIKA-4218 ]


Tilman Hausherr deleted comment on TIKA-4218:
---

was (Author: tilman):
There are also improvements not in my own test results, e.g. the "FOP" pdf 
file. Either something went wrong with my test, or with yours.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830071#comment-17830071
 ] 

Tilman Hausherr commented on TIKA-4218:
---

There are also improvements not in my own test results, e.g. the "FOP" pdf 
file. Either something went wrong with my test, or with yours.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830069#comment-17830069
 ] 

Tilman Hausherr commented on TIKA-4218:
---

Weird indeed, 876503.pdf didn't appear in the PDFBox regression tests:
https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.30_vs_2.0.31.tar.xz

I think we did make one (harmless) last minute change after the regression 
tests, so I just ran ExtractText with both versions and no difference.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4206) Variation on Zip Bomb

2024-03-15 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4206:
--
Description: 
I see TIKA-216 which aims to prevent Zip bombs, but I'm seeing what looks like 
a bomb on 3.0.0 Beta. The zip bomb is a mime encoded attachment to an email, 
which may be why it isn't throwing an error.

On my machine attempting to extract text (-J) the process continues infinitely 
(or at least 10 hours, which is when I stopped it).

The actual file is embedded in a .gz file inside of an ARC file. However, 
extracting the attached .txt file produces the same error.

 

The original ARC file is at: 
[https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-004/warc/NARA-PEOT-2004-2004065521-04317-crawling-fast-c_NARA-PEOT-2004-2004101148-00173-crawling008.archive.org.arc.gz]

  was:
I see Tika-216 which aims to prevent Zip bombs, but I'm seeing what looks like 
a bomb on 3.0.0 Beta. The zip bomb is a mime encoded attachment to an email, 
which may be why it isn't throwing an error.

On my machine attempting to extract text (-J) the process continues infinitely 
(or at least 10 hours, which is when I stopped it).

The actual file is embedded in a .gz file inside of an ARC file. However, 
extracting the attached .txt file produces the same error.

 

The original ARC file is at: 
https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-004/warc/NARA-PEOT-2004-2004065521-04317-crawling-fast-c_NARA-PEOT-2004-2004101148-00173-crawling008.archive.org.arc.gz


> Variation on Zip Bomb
> -
>
> Key: TIKA-4206
> URL: https://issues.apache.org/jira/browse/TIKA-4206
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Major
> Attachments: sample-42-mail-bomb.txt
>
>
> I see TIKA-216 which aims to prevent Zip bombs, but I'm seeing what looks 
> like a bomb on 3.0.0 Beta. The zip bomb is a mime encoded attachment to an 
> email, which may be why it isn't throwing an error.
> On my machine attempting to extract text (-J) the process continues 
> infinitely (or at least 10 hours, which is when I stopped it).
> The actual file is embedded in a .gz file inside of an ARC file. However, 
> extracting the attached .txt file produces the same error.
>  
> The original ARC file is at: 
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-004/warc/NARA-PEOT-2004-2004065521-04317-crawling-fast-c_NARA-PEOT-2004-2004101148-00173-crawling008.archive.org.arc.gz]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4214) Update apache compress in tika to 1.26+ for CVE-2024-26308.

2024-03-15 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4214.
-
Resolution: Duplicate

Duplicate of TIKA-4199.

> Update apache compress in tika to 1.26+ for CVE-2024-26308.
> ---
>
> Key: TIKA-4214
> URL: https://issues.apache.org/jira/browse/TIKA-4214
> Project: Tika
>  Issue Type: Bug
>Reporter: Dhoka Pramod
>Priority: Major
>
> Need fix for [CVE-2024-26308|https://nvd.nist.gov/vuln/detail/CVE-2024-26308] 
> in tika latest version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

2024-03-14 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826996#comment-17826996
 ] 

Tilman Hausherr commented on TIKA-4199:
---

The original error you reported wasn't really a bug in commons compress, rather 
a change that more bytes were read than tika expected, see my first comment in 
COMPRESS-661. It resulted in several fixes in tika.

> commons-compress 1.26.0 breaks Apache Tika 2.9.1
> 
>
> Key: TIKA-4199
> URL: https://issues.apache.org/jira/browse/TIKA-4199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Alexander Veit
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.9.2, 3.0.0
>
>
> An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 
> breaks Tika.
>  
> For more information see https://issues.apache.org/jira/browse/COMPRESS-661.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (TIKA-4166) dependency updates for Tika 3.0

2024-03-09 Thread Tilman Hausherr (Jira)


[ https://issues.apache.org/jira/browse/TIKA-4166 ]


Tilman Hausherr deleted comment on TIKA-4166:
---

was (Author: tilman):
I've reverted it and will investigate / fix this later. Seems to be a problem 
with angus-activation.

> dependency updates for Tika 3.0
> ---
>
> Key: TIKA-4166
> URL: https://issues.apache.org/jira/browse/TIKA-4166
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0

2024-03-09 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824953#comment-17824953
 ] 

Tilman Hausherr commented on TIKA-4166:
---

I've reverted it and will investigate / fix this later. Seems to be a problem 
with angus-activation.

> dependency updates for Tika 3.0
> ---
>
> Key: TIKA-4166
> URL: https://issues.apache.org/jira/browse/TIKA-4166
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

2024-03-09 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4199.
---
Resolution: Fixed

Commons-Compress has been updated to 1.26.1, I have reverted the workaround and 
a change that wasn't helpful.

> commons-compress 1.26.0 breaks Apache Tika 2.9.1
> 
>
> Key: TIKA-4199
> URL: https://issues.apache.org/jira/browse/TIKA-4199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Alexander Veit
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.9.2, 3.0.0
>
>
> An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 
> breaks Tika.
>  
> For more information see https://issues.apache.org/jira/browse/COMPRESS-661.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

2024-03-09 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reassigned TIKA-4199:
-

Assignee: Tilman Hausherr

> commons-compress 1.26.0 breaks Apache Tika 2.9.1
> 
>
> Key: TIKA-4199
> URL: https://issues.apache.org/jira/browse/TIKA-4199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Alexander Veit
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.9.2, 3.0.0
>
>
> An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 
> breaks Tika.
>  
> For more information see https://issues.apache.org/jira/browse/COMPRESS-661.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4203) Add @deprecated annotation where needed

2024-02-24 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4203:
--
Fix Version/s: 3.0.0

> Add @deprecated annotation where needed
> ---
>
> Key: TIKA-4203
> URL: https://issues.apache.org/jira/browse/TIKA-4203
> Project: Tika
>  Issue Type: Task
>Affects Versions: 3.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Trivial
> Fix For: 3.0.0
>
>
> This is just to prevent my IDE from stopping from scrolling during the build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4203) Add @deprecated annotation where needed

2024-02-24 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4203:
--
Affects Version/s: 3.0.0

> Add @deprecated annotation where needed
> ---
>
> Key: TIKA-4203
> URL: https://issues.apache.org/jira/browse/TIKA-4203
> Project: Tika
>  Issue Type: Task
>Affects Versions: 3.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Trivial
>
> This is just to prevent my IDE from stopping from scrolling during the build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4203) Add @deprecated annotation where needed

2024-02-24 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4203.
---
Resolution: Fixed

> Add @deprecated annotation where needed
> ---
>
> Key: TIKA-4203
> URL: https://issues.apache.org/jira/browse/TIKA-4203
> Project: Tika
>  Issue Type: Task
>Affects Versions: 3.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Trivial
> Fix For: 3.0.0
>
>
> This is just to prevent my IDE from stopping from scrolling during the build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4203) Add @deprecated annotation where needed

2024-02-24 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4203:
-

 Summary: Add @deprecated annotation where needed
 Key: TIKA-4203
 URL: https://issues.apache.org/jira/browse/TIKA-4203
 Project: Tika
  Issue Type: Task
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr


This is just to prevent my IDE from stopping from scrolling during the build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

2024-02-20 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4199:
--
Fix Version/s: 2.9.2
   3.0.0

> commons-compress 1.26.0 breaks Apache Tika 2.9.1
> 
>
> Key: TIKA-4199
> URL: https://issues.apache.org/jira/browse/TIKA-4199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Alexander Veit
>Priority: Major
> Fix For: 2.9.2, 3.0.0
>
>
> An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 
> breaks Tika.
>  
> For more information see https://issues.apache.org/jira/browse/COMPRESS-661.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

2024-02-20 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818937#comment-17818937
 ] 

Tilman Hausherr commented on TIKA-4199:
---

I tried an another solution
{code:java}
if (archive.markSupported())
{
archive = new ArchiveInputStreamWrapper(archive);
}
{code}
which also works. The wrapper delegates all except markSupported. I'll wait a 
few days if the commons compress people fix this. If not then I'll commit that 
solution.

> commons-compress 1.26.0 breaks Apache Tika 2.9.1
> 
>
> Key: TIKA-4199
> URL: https://issues.apache.org/jira/browse/TIKA-4199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Alexander Veit
>Priority: Major
>
> An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 
> breaks Tika.
>  
> For more information see https://issues.apache.org/jira/browse/COMPRESS-661.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4201) Add hard limit to stream reading in IWorksParser#detectType

2024-02-20 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818873#comment-17818873
 ] 

Tilman Hausherr commented on TIKA-4201:
---

Yeah, makes sense.

> Add hard limit to stream reading in IWorksParser#detectType
> ---
>
> Key: TIKA-4201
> URL: https://issues.apache.org/jira/browse/TIKA-4201
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> TIKA-4199 showed us that we had been relying on hope that the detector would 
> only read a limited number of bytes when detecting in 
> IWorksParser#detectType. We should cache the first 1096 bytes, do the 
> detection, and then reset the wrapped archive stream. 
> We should not rely on hope.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

2024-02-20 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818867#comment-17818867
 ] 

Tilman Hausherr edited comment on TIKA-4199 at 2/20/24 3:37 PM:


{quote}I'm not declaring this a problem with commons-compress!
{quote}
My bet was 51% it's with Tika but from the latest test code you inspired me to 
write in COMPRESS-661, it might be them or BufferedInputStream itself, or a 
misunderstanding how BufferedInputStream works.

I also found another incomplete delegate class (BoundedInputStream), I'll 
complete that one too.


was (Author: tilman):
{quote}I'm not declaring this a problem with commons-compress!
{quote}
My bet was 51% it's with Tika but from the latest test code you inspired me to 
write in COMPRESS-661, it might be them or BufferedInputStream itself.

I also found another incomplete delegate class (BoundedInputStream), I'll 
complete that one too.

> commons-compress 1.26.0 breaks Apache Tika 2.9.1
> 
>
> Key: TIKA-4199
> URL: https://issues.apache.org/jira/browse/TIKA-4199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Alexander Veit
>Priority: Major
>
> An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 
> breaks Tika.
>  
> For more information see https://issues.apache.org/jira/browse/COMPRESS-661.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

2024-02-20 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818867#comment-17818867
 ] 

Tilman Hausherr commented on TIKA-4199:
---

{quote}I'm not declaring this a problem with commons-compress!
{quote}
My bet was 51% it's with Tika but from the latest test code you inspired me to 
write in COMPRESS-661, it might be them or BufferedInputStream itself.

I also found another incomplete delegate class (BoundedInputStream), I'll 
complete that one too.

> commons-compress 1.26.0 breaks Apache Tika 2.9.1
> 
>
> Key: TIKA-4199
> URL: https://issues.apache.org/jira/browse/TIKA-4199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Alexander Veit
>Priority: Major
>
> An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 
> breaks Tika.
>  
> For more information see https://issues.apache.org/jira/browse/COMPRESS-661.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

2024-02-20 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818823#comment-17818823
 ] 

Tilman Hausherr commented on TIKA-4199:
---

After merging I discovered that the SevenZWrapper class is incomplete 
(markSupported / mark / reset was missing, and many more). I tested reverting 
my one-line change, and some of the previously failing tests (e.g. the 7z 
tests) were now succeeding. So this kindof suggests that the cause is related 
to markSupported  / mark / reset. If we ever find that cause, then the one-line 
change in {{PackageParser}} can be removed because it makes things slower.

> commons-compress 1.26.0 breaks Apache Tika 2.9.1
> 
>
> Key: TIKA-4199
> URL: https://issues.apache.org/jira/browse/TIKA-4199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Alexander Veit
>Priority: Major
>
> An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 
> breaks Tika.
>  
> For more information see https://issues.apache.org/jira/browse/COMPRESS-661.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4200) Fix broken build after upgrade to commons-compress

2024-02-20 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4200.
-
Resolution: Duplicate

Our CI is failing because of the CVE :-( Duplicate of TIKA-4199. I'm still 
working on it and I'm somewhat more optimistic than yesterday.

> Fix broken build after upgrade to commons-compress
> --
>
> Key: TIKA-4200
> URL: https://issues.apache.org/jira/browse/TIKA-4200
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> https://github.com/apache/tika/actions/runs/7955214068
> Looks like builds are failing after upgrade to commons compress. Apple (zip) 
> files are causing problems?! We should look into this and either fix our code 
> or, if there's a problem with commons-compress, identify it, report it and 
> downgrade to the last working version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

2024-02-20 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818774#comment-17818774
 ] 

Tilman Hausherr edited comment on TIKA-4199 at 2/20/24 11:57 AM:
-

I'm working on it

https://github.com/apache/tika/pull/1605


was (Author: tilman):
I'm working on it

https://github.com/apache/pdfbox/pull/180

> commons-compress 1.26.0 breaks Apache Tika 2.9.1
> 
>
> Key: TIKA-4199
> URL: https://issues.apache.org/jira/browse/TIKA-4199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Alexander Veit
>Priority: Major
>
> An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 
> breaks Tika.
>  
> For more information see https://issues.apache.org/jira/browse/COMPRESS-661.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

2024-02-20 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818774#comment-17818774
 ] 

Tilman Hausherr commented on TIKA-4199:
---

I'm working on it

https://github.com/apache/pdfbox/pull/180

> commons-compress 1.26.0 breaks Apache Tika 2.9.1
> 
>
> Key: TIKA-4199
> URL: https://issues.apache.org/jira/browse/TIKA-4199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Alexander Veit
>Priority: Major
>
> An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 
> breaks Tika.
>  
> For more information see https://issues.apache.org/jira/browse/COMPRESS-661.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3841) An exception occurred when parsing some word documents using tika, tika_exception

2024-02-09 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-3841:
--
Summary: An exception occurred when parsing some word documents using tika, 
tika_exception  (was: An exception occurred when parsing some word documents 
using tikatika_exception)

> An exception occurred when parsing some word documents using tika, 
> tika_exception
> -
>
> Key: TIKA-3841
> URL: https://issues.apache.org/jira/browse/TIKA-3841
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24, 2.4.1, 1.28.4
> Environment: h3. Java Version
> java version "1.8.0_291"
> h3. OS Version
> Linux localhost.localdomain 3.10.0-957.el7.x86_64 
> [#1|https://github.com/elastic/elasticsearch/issues/1] SMP Thu Nov 8 23:39:32 
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: lxz
>Priority: Blocker
>
> {
>     "error": {
>         "root_cause": [
> {             "type": "parse_exception",             "reason": "Error parsing 
> document in field [content]"         }
> ],
>         "type": "parse_exception",
>         "reason": "Error parsing document in field [content]",
>         "caused_by": {
>             "type": "tika_exception",
>             "reason": "Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@3b5e180a",
>             "caused_by":
> {                 "type": "array_index_out_of_bounds_exception",              
>    "reason": "351"             }
>         }
>     },
>     "status": 400
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3841) An exception occurred when parsing some word documents using tikatika_exception

2024-02-09 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-3841:
--
Summary: An exception occurred when parsing some word documents using 
tikatika_exception  (was: 使用tika解析部分word文档出现异常,tika_exception)

> An exception occurred when parsing some word documents using 
> tikatika_exception
> ---
>
> Key: TIKA-3841
> URL: https://issues.apache.org/jira/browse/TIKA-3841
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24, 2.4.1, 1.28.4
> Environment: h3. Java Version
> java version "1.8.0_291"
> h3. OS Version
> Linux localhost.localdomain 3.10.0-957.el7.x86_64 
> [#1|https://github.com/elastic/elasticsearch/issues/1] SMP Thu Nov 8 23:39:32 
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: lxz
>Priority: Blocker
>
> {
>     "error": {
>         "root_cause": [
> {             "type": "parse_exception",             "reason": "Error parsing 
> document in field [content]"         }
> ],
>         "type": "parse_exception",
>         "reason": "Error parsing document in field [content]",
>         "caused_by": {
>             "type": "tika_exception",
>             "reason": "Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@3b5e180a",
>             "caused_by":
> {                 "type": "array_index_out_of_bounds_exception",              
>    "reason": "351"             }
>         }
>     },
>     "status": 400
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4183) Update jackson-databind jar to 2.16.0 or higher (CVE-2023-35116)

2024-01-22 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4183.
-
Resolution: Duplicate

duplicate of TIKA-4162, it was done there on 17.11.2023 in 
a8c3879a59a4a8ccb8522518a286ebef442c0f4c, the commits are all missing for some 
reason. It will be in 2.9.2. No, I don't know when this will be released.

> Update jackson-databind jar to 2.16.0 or higher (CVE-2023-35116)
> 
>
> Key: TIKA-4183
> URL: https://issues.apache.org/jira/browse/TIKA-4183
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.1
>Reporter: Dhoka Pramod
>Priority: Major
>
> Latest stable tika version 2.9.1 (in tika eval app) still has 
> jackson-databind-2.15.2.
> It needs to be updated to 2.16.0 or higher to address 
> [https://nvd.nist.gov/vuln/detail/CVE-2023-35116]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4162) Update to 2.9.2

2024-01-22 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4162:
--
Fix Version/s: 2.9.2

> Update to 2.9.2
> ---
>
> Key: TIKA-4162
> URL: https://issues.apache.org/jira/browse/TIKA-4162
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.1
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 2.9.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4162) Update to 2.9.2

2023-12-27 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4162:
--
Affects Version/s: 2.9.1

> Update to 2.9.2
> ---
>
> Key: TIKA-4162
> URL: https://issues.apache.org/jira/browse/TIKA-4162
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.1
>Reporter: Tilman Hausherr
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-12-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4172.
-
Resolution: Not A Bug

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4173) Fix dev version in main branch

2023-12-13 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796450#comment-17796450
 ] 

Tilman Hausherr commented on TIKA-4173:
---

It wasn't really a problem locally, I only had to change one line in one file 
and was able to test all updates.

> Fix dev version in main branch
> --
>
> Key: TIKA-4173
> URL: https://issues.apache.org/jira/browse/TIKA-4173
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
> Fix For: 3.0.0
>
>
> For some reason, and I should have caught this earlier, it looks like the 
> 3.0.0-SNAPSHOT version was not re-applied to the main branch after the 
> release. We need to fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4173) Fix dev version in main branch

2023-12-13 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796431#comment-17796431
 ] 

Tilman Hausherr commented on TIKA-4173:
---

I noticed that it didn't have the correct version, but I thought that was 
something just before the release.

> Fix dev version in main branch
> --
>
> Key: TIKA-4173
> URL: https://issues.apache.org/jira/browse/TIKA-4173
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> For some reason, and I should have caught this earlier, it looks like the 
> 3.0.0-SNAPSHOT version was not re-applied to the main branch after the 
> release. We need to fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-11-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789647#comment-17789647
 ] 

Tilman Hausherr commented on TIKA-4172:
---


  

  

  

Your file starts with 00 14 64 30.

See also https://www.iana.org/assignments/media-types/application/applefile

No I don't agree, because: what is a "binary" file after all? There is no fixed 
definition for this, it's just a file that hasn't been classified.

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-11-24 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789542#comment-17789542
 ] 

Tilman Hausherr commented on TIKA-4172:
---

application/octet-stream is defined as the default by the detection interface 
if it doesn't know. tika-mimetypes.xml does't seem to have anything magic that 
matches your file content.

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-11-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789318#comment-17789318
 ] 

Tilman Hausherr commented on TIKA-4172:
---

https://tika.apache.org/2.1.0/detection.html

"Where the name of the file is known, it is sometimes possible to guess the 
file type from the name or extension. Within the tika-mimetypes.xml file is a 
list of patterns which are used to identify the type from the filename.

However, because files may be renamed, this method of detection is quick but 
not always as accurate."

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-11-22 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788982#comment-17788982
 ] 

Tilman Hausherr edited comment on TIKA-4172 at 11/23/23 5:05 AM:
-

Which tika call are you using? Have you tried detecting based on content only?


was (Author: tilman):
Which tika call are you using? Have you tried detecting purely on content?

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-11-22 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788982#comment-17788982
 ] 

Tilman Hausherr commented on TIKA-4172:
---

Which tika call are you using? Have you tried detecting purely on content?

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0

2023-11-04 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782915#comment-17782915
 ] 

Tilman Hausherr commented on TIKA-4166:
---

The zookeeper update worked locally, but not on the CI :-(

> dependency updates for Tika 3.0
> ---
>
> Key: TIKA-4166
> URL: https://issues.apache.org/jira/browse/TIKA-4166
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4166) dependency updates for Tika 3.0

2023-11-03 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4166:
-

 Summary: dependency updates for Tika 3.0
 Key: TIKA-4166
 URL: https://issues.apache.org/jira/browse/TIKA-4166
 Project: Tika
  Issue Type: Task
  Components: build
Reporter: Tilman Hausherr
 Fix For: 3.0.0-BETA


Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4162) Update to 2.9.2

2023-10-21 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4162:
-

 Summary: Update to 2.9.2
 Key: TIKA-4162
 URL: https://issues.apache.org/jira/browse/TIKA-4162
 Project: Tika
  Issue Type: Task
  Components: build
Reporter: Tilman Hausherr






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   >