[jira] [Updated] (TIKA-4310) Add CloseShield to JSoupParser

2024-09-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4310:
--
Fix Version/s: 3.0.0

> Add CloseShield to JSoupParser
> --
>
> Key: TIKA-4310
> URL: https://issues.apache.org/jira/browse/TIKA-4310
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> The JsoupParser under the hood closes the reader and thereby the stream. This 
> breaks the normal parser contract that the parser does not close the stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4310) Add CloseShield to JSoupParser

2024-09-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4310.
---
Resolution: Fixed

> Add CloseShield to JSoupParser
> --
>
> Key: TIKA-4310
> URL: https://issues.apache.org/jira/browse/TIKA-4310
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> The JsoupParser under the hood closes the reader and thereby the stream. This 
> breaks the normal parser contract that the parser does not close the stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4310) Add CloseShield to JSoupParser

2024-09-16 Thread Tim Allison (Jira)
Tim Allison created TIKA-4310:
-

 Summary: Add CloseShield to JSoupParser
 Key: TIKA-4310
 URL: https://issues.apache.org/jira/browse/TIKA-4310
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


The JsoupParser under the hood closes the reader and thereby the stream. This 
breaks the normal parser contract that the parser does not close the stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8

2024-09-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880641#comment-17880641
 ] 

Tim Allison commented on TIKA-4305:
---

K. So there are two different issues.

1) Above, I was focusing on the difference in detection between UCS2 vs UTF-16 
and UCS4 vs UTF32. There's not much we can do with that.
2) The other issue is that Tika can have a hard time determining that an 
InputStream is a text file unless the filename is included as a hint. Without 
the file name, Tika detects octet-stream.

So, either of these work for communicating the file name to Tika:

a)System.out.println(new 
Tika().parseToString(Paths.get(".../multilingual_test_new_UCS-2.txt")));
b)Metadata metadata = new Metadata();
try (InputStream is = 
TikaInputStream.get(Paths.get("/home/tallison/Downloads/multilingual_test_new_UCS-2.txt"),
 metadata)) {
System.out.println(new Tika().parseToString(is, metadata));
}

> Tika producing empty output for UCS encoded txt files; parses UTF-7 files as 
> UTF-8
> --
>
> Key: TIKA-4305
> URL: https://issues.apache.org/jira/browse/TIKA-4305
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-core
>Affects Versions: 2.9.2
> Environment: Ubuntu 22.04 LTS
>Reporter: Manish S N
>Priority: Minor
> Attachments: multilingual_test_new_UCS-2.txt, 
> multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, 
> multilingual_test_new_UTF-8.txt, pom.xml, tika_UTF-7_output.txt
>
>
> Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files.
> No logs or errors just an empty string.
> Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks 
> havoc with non ascii characters.
> how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open 
> dialog of gedit and found the outputs similar
>  
> I am attaching all four encoded files along with tika's output from parsing 
> the UTF-7 for reference



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8

2024-09-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880641#comment-17880641
 ] 

Tim Allison edited comment on TIKA-4305 at 9/10/24 1:20 PM:


K. So there are two different issues.

1) Above, I was focusing on the difference in detection between UCS2 vs UTF-16 
and UCS4 vs UTF32. There's not much we can do with that.
2) The other issue is that Tika can have a hard time determining that an 
InputStream is a text file (esp for UTF16 and UTF32) unless the filename is 
included as a hint. Without the file name, Tika detects octet-stream.

There are several ways to pass the filename hint to Tika. These are two of them:

a)System.out.println(new 
Tika().parseToString(Paths.get(".../multilingual_test_new_UCS-2.txt")));
b)Metadata metadata = new Metadata();
try (InputStream is = 
TikaInputStream.get(Paths.get(".../multilingual_test_new_UCS-2.txt"), 
metadata)) {
System.out.println(new Tika().parseToString(is, metadata));
}


was (Author: talli...@mitre.org):
K. So there are two different issues.

1) Above, I was focusing on the difference in detection between UCS2 vs UTF-16 
and UCS4 vs UTF32. There's not much we can do with that.
2) The other issue is that Tika can have a hard time determining that an 
InputStream is a text file unless the filename is included as a hint. Without 
the file name, Tika detects octet-stream.

So, either of these work for communicating the file name to Tika:

a)System.out.println(new 
Tika().parseToString(Paths.get(".../multilingual_test_new_UCS-2.txt")));
b)Metadata metadata = new Metadata();
try (InputStream is = 
TikaInputStream.get(Paths.get(".../multilingual_test_new_UCS-2.txt"), 
metadata)) {
System.out.println(new Tika().parseToString(is, metadata));
}

> Tika producing empty output for UCS encoded txt files; parses UTF-7 files as 
> UTF-8
> --
>
> Key: TIKA-4305
> URL: https://issues.apache.org/jira/browse/TIKA-4305
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-core
>Affects Versions: 2.9.2
> Environment: Ubuntu 22.04 LTS
>Reporter: Manish S N
>Priority: Minor
> Attachments: multilingual_test_new_UCS-2.txt, 
> multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, 
> multilingual_test_new_UTF-8.txt, pom.xml, tika_UTF-7_output.txt
>
>
> Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files.
> No logs or errors just an empty string.
> Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks 
> havoc with non ascii characters.
> how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open 
> dialog of gedit and found the outputs similar
>  
> I am attaching all four encoded files along with tika's output from parsing 
> the UTF-7 for reference



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8

2024-09-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880641#comment-17880641
 ] 

Tim Allison edited comment on TIKA-4305 at 9/10/24 1:18 PM:


K. So there are two different issues.

1) Above, I was focusing on the difference in detection between UCS2 vs UTF-16 
and UCS4 vs UTF32. There's not much we can do with that.
2) The other issue is that Tika can have a hard time determining that an 
InputStream is a text file unless the filename is included as a hint. Without 
the file name, Tika detects octet-stream.

So, either of these work for communicating the file name to Tika:

a)System.out.println(new 
Tika().parseToString(Paths.get(".../multilingual_test_new_UCS-2.txt")));
b)Metadata metadata = new Metadata();
try (InputStream is = 
TikaInputStream.get(Paths.get(".../multilingual_test_new_UCS-2.txt"), 
metadata)) {
System.out.println(new Tika().parseToString(is, metadata));
}


was (Author: talli...@mitre.org):
K. So there are two different issues.

1) Above, I was focusing on the difference in detection between UCS2 vs UTF-16 
and UCS4 vs UTF32. There's not much we can do with that.
2) The other issue is that Tika can have a hard time determining that an 
InputStream is a text file unless the filename is included as a hint. Without 
the file name, Tika detects octet-stream.

So, either of these work for communicating the file name to Tika:

a)System.out.println(new 
Tika().parseToString(Paths.get(".../multilingual_test_new_UCS-2.txt")));
b)Metadata metadata = new Metadata();
try (InputStream is = 
TikaInputStream.get(Paths.get("/home/tallison/Downloads/multilingual_test_new_UCS-2.txt"),
 metadata)) {
System.out.println(new Tika().parseToString(is, metadata));
}

> Tika producing empty output for UCS encoded txt files; parses UTF-7 files as 
> UTF-8
> --
>
> Key: TIKA-4305
> URL: https://issues.apache.org/jira/browse/TIKA-4305
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-core
>Affects Versions: 2.9.2
> Environment: Ubuntu 22.04 LTS
>Reporter: Manish S N
>Priority: Minor
> Attachments: multilingual_test_new_UCS-2.txt, 
> multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, 
> multilingual_test_new_UTF-8.txt, pom.xml, tika_UTF-7_output.txt
>
>
> Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files.
> No logs or errors just an empty string.
> Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks 
> havoc with non ascii characters.
> how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open 
> dialog of gedit and found the outputs similar
>  
> I am attaching all four encoded files along with tika's output from parsing 
> the UTF-7 for reference



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4307) Text in header not extracted for Microsoft Word doc file

2024-09-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880628#comment-17880628
 ] 

Tim Allison commented on TIKA-4307:
---

I asked for help from fellow POI devs: 
https://bz.apache.org/bugzilla/show_bug.cgi?id=69314

> Text in header not extracted for Microsoft Word doc file
> 
>
> Key: TIKA-4307
> URL: https://issues.apache.org/jira/browse/TIKA-4307
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: August Valera
>Priority: Major
> Attachments: 560702J-2x-converted.doc, 560702J-converted.docx, 
> 560702J-full-output.txt, 560702J.doc, screenshot-1.png
>
>
> We have a Microsoft Word doc file with text in the header. That header text 
> is not successfully extracted alongside the file content, but converting the 
> file to a docx file results in successful extraction.
> Samples are attached, conversion done using cloudconvert.com.
>  * [^560702J.doc] Original doc file, missing content
>  * [^560702J-converted.docx] Converted to docx file, correct output
>  * [^560702J-2x-converted.doc] Docx file converted back to doc, again missing 
> content
> h3. Current Behavior
> doc files omit header text. docx files extract header text correctly.
> h3. Expected Behavior
> doc and docx files with identical content in header should result in 
> identical output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8

2024-09-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880594#comment-17880594
 ] 

Tim Allison commented on TIKA-4305:
---

Thank you. Y, as I mentioned above, I effectively conflated UCS2=UTF-16 and 
UCS4=UTF32 and declared victory.  

Are there significant differences in bytes in the UCS2 vs UTF-16 or for UCS4 vs 
UTF32 in the file you submitted? The UniversalCharsetDetector linked above does 
mention {{X-ISO-10646-UCS-4-3412 / X-ISO-10646-UCS-4-2143}} but it did not 
detect UCS4 on the test file.

Unfortunately, as with UTF-7, the answer is basically the same: unless you can 
get the upstream projects to detect these charsets or unless you can find 
another detector with a friendly license, there's not much we can do.

> Tika producing empty output for UCS encoded txt files; parses UTF-7 files as 
> UTF-8
> --
>
> Key: TIKA-4305
> URL: https://issues.apache.org/jira/browse/TIKA-4305
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-core
>Affects Versions: 2.9.2
> Environment: Ubuntu 22.04 LTS
>Reporter: Manish S N
>Priority: Minor
> Attachments: multilingual_test_new_UCS-2.txt, 
> multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, 
> multilingual_test_new_UTF-8.txt, pom.xml, tika_UTF-7_output.txt
>
>
> Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files.
> No logs or errors just an empty string.
> Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks 
> havoc with non ascii characters.
> how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open 
> dialog of gedit and found the outputs similar
>  
> I am attaching all four encoded files along with tika's output from parsing 
> the UTF-7 for reference



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8

2024-09-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880344#comment-17880344
 ] 

Tim Allison commented on TIKA-4305:
---

For those raising an eyebrow over anyone still using UTF-7, you are not alone: 
https://en.wikipedia.org/wiki/UTF-7

> Tika producing empty output for UCS encoded txt files; parses UTF-7 files as 
> UTF-8
> --
>
> Key: TIKA-4305
> URL: https://issues.apache.org/jira/browse/TIKA-4305
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-core
>Affects Versions: 2.9.2
> Environment: Ubuntu 22.04 LTS
>Reporter: Manish S N
>Priority: Minor
> Attachments: multilingual_test_new_UCS-2.txt, 
> multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, 
> multilingual_test_new_UTF-8.txt, tika_UTF-7_output.txt
>
>
> Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files.
> No logs or errors just an empty string.
> Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks 
> havoc with non ascii characters.
> how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open 
> dialog of gedit and found the outputs similar
>  
> I am attaching all four encoded files along with tika's output from parsing 
> the UTF-7 for reference



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8

2024-09-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880342#comment-17880342
 ] 

Tim Allison commented on TIKA-4305:
---

I get text that I think is correct with tika app 2.9.2: {{java -jar 
tika-app-2.9.2.jar multilingual_test_new_UCS-4.txt}} and when I run it against 
UCS-2.

How are you running Tika when you get an empty string for the UTF-16 and UTF-32?

> Tika producing empty output for UCS encoded txt files; parses UTF-7 files as 
> UTF-8
> --
>
> Key: TIKA-4305
> URL: https://issues.apache.org/jira/browse/TIKA-4305
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-core
>Affects Versions: 2.9.2
> Environment: Ubuntu 22.04 LTS
>Reporter: Manish S N
>Priority: Minor
> Attachments: multilingual_test_new_UCS-2.txt, 
> multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, 
> multilingual_test_new_UTF-8.txt, tika_UTF-7_output.txt
>
>
> Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files.
> No logs or errors just an empty string.
> Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks 
> havoc with non ascii characters.
> how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open 
> dialog of gedit and found the outputs similar
>  
> I am attaching all four encoded files along with tika's output from parsing 
> the UTF-7 for reference



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8

2024-09-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880341#comment-17880341
 ] 

Tim Allison commented on TIKA-4305:
---

Thank you for raising this issue.

For the following, I'm running a unit test in Tika's main branch in the 
{{tika-parsers-standard-package}} module:

* UTF-8 works
* UCS-4 is detected correctly by the ICU4j detector as UTF-32BE.
* UCS-2 is detected correctly by the ICU4j detector as UTF-16LE.
* UTF-7 is incorrectly detected as windows-1252 by the 
UniversalCharsetDetector. If I turn off the UniversalCharsetDetector, the ICU4j 
detector incorrectly detects charset=ISO-8859-1

The fork of UniversalCharsetDetector that we use 
(https://github.com/albfernandez/juniversalchardet) does not claim to detect 
utf-7. ICU4j also does not detect utf-7 
(https://unicode-org.github.io/icu/userguide/conversion/detection.html#detected-encodings).
 So, if you can open a ticket in one of those projects and/or identify another 
charset detector that has a friendly license and can detect utf-7, we should 
look into adding that to Tika.




> Tika producing empty output for UCS encoded txt files; parses UTF-7 files as 
> UTF-8
> --
>
> Key: TIKA-4305
> URL: https://issues.apache.org/jira/browse/TIKA-4305
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-core
>Affects Versions: 2.9.2
> Environment: Ubuntu 22.04 LTS
>Reporter: Manish S N
>Priority: Minor
> Attachments: multilingual_test_new_UCS-2.txt, 
> multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, 
> multilingual_test_new_UTF-8.txt, tika_UTF-7_output.txt
>
>
> Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files.
> No logs or errors just an empty string.
> Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks 
> havoc with non ascii characters.
> how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open 
> dialog of gedit and found the outputs similar
>  
> I am attaching all four encoded files along with tika's output from parsing 
> the UTF-7 for reference



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4306) ffmpeg all the images

2024-09-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880335#comment-17880335
 ] 

Tim Allison commented on TIKA-4306:
---

The way to accomplish this would be to add more mime types here: 
https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml#L30

Any others besides jpeg2000 and jbig?

> ffmpeg all the images
> -
>
> Key: TIKA-4306
> URL: https://issues.apache.org/jira/browse/TIKA-4306
> Project: Tika
>  Issue Type: Wish
>  Components: tika-core
>Affects Versions: 3.0.0-BETA
>Reporter: Jim Northrup
>Priority: Major
>
> jpeg2000 and JBIG and a few other image formats (numerous) could benefit from 
> the generic features of a simple  `ffmpeg -i  output`
> seems like webp is low hanging fruit here as well as the format updates that 
> ffmpeg experiences rarely alter the commandline.
> literally solving for one image format is typically only changing a file 
> extension with a few exemptions.
> Thanks in advance for any conversation toward testing this 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4239) Update to 2.9.3

2024-08-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877432#comment-17877432
 ] 

Tim Allison commented on TIKA-4239:
---

Thank you!

> Update to 2.9.3
> ---
>
> Key: TIKA-4239
> URL: https://issues.apache.org/jira/browse/TIKA-4239
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4301) Factor tika pipes base classes out of tika-core into a tika-pipes-core module

2024-08-27 Thread Tim Allison (Jira)
Tim Allison created TIKA-4301:
-

 Summary: Factor tika pipes base classes out of tika-core into a 
tika-pipes-core module
 Key: TIKA-4301
 URL: https://issues.apache.org/jira/browse/TIKA-4301
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


This is part of the larger vision of TIKA-4272. This will slim down tika-core 
by itself, and it will prevent us from adding extra dependencies for (pf4j and 
semver and...) into tika-core on TIKA-4272.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release

2024-08-21 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875614#comment-17875614
 ] 

Tim Allison commented on TIKA-4280:
---

Got it. Now I see. Thank you. I don't think we have an active stakeholder for 
tika-dl or several of the submodules in tika-parsers-ml. :( I'd be happy to be 
proven wrong!

So, that leaves the *-M* issues... are we ok with leaving those two as they 
are? 

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release

2024-08-21 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875577#comment-17875577
 ] 

Tim Allison commented on TIKA-4280:
---

bq. Decide about the ffmpeg issue and the hdf5 issue

Sorry, what are the issues here?

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release

2024-08-21 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875579#comment-17875579
 ] 

Tim Allison commented on TIKA-4280:
---

bq. TIKA-4290 Tilman question

Does anything remain on this issue? I think we're good. That was a bunch of 
cleanup.

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release

2024-08-21 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875574#comment-17875574
 ] 

Tim Allison commented on TIKA-4280:
---

bq. Before releasing the real 3.0.0 we need to remove any "-M" dependencies

I see commons.collections4 4.5.0-M2 and dl4j 1.0.0-M2.1 in Tika's parent pom. 

For commons collections4, the last non-M release was in 2019. Is there enough 
of a reason to revert to 4.4?

For dl4j, the last non-M release would take us back to 2017 with 0.9.1. Is 
there enough of a reason to revert to 0.9.1?

Looking at transitive dependencies, I see 
org.datavec:datavec-data-image:jar:1.0.0-M2.1, org.nd4j:jackson:jar:1.0.0-M2.1 
and a few other org.nd4j:* (which are consistent with dl4j). So no surprises in 
transitive dependencies?

Are there others *-M dependencies? Or, has this already been cleaned up? Thank 
you [~tilman] for raising this point.

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4299) Clean up pagination in AbstractPDF2XHTML

2024-08-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4299.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Clean up pagination in AbstractPDF2XHTML
> 
>
> Key: TIKA-4299
> URL: https://issues.apache.org/jira/browse/TIKA-4299
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> This is a follow on to TIKA-4296. We should remove the "why this hack" 
> comment and perhaps clean up small bits of other code related to pagination 
> in AbstractPDF2XHTML.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4299) Clean up pagination in AbstractPDF2XHTML

2024-08-19 Thread Tim Allison (Jira)
Tim Allison created TIKA-4299:
-

 Summary: Clean up pagination in AbstractPDF2XHTML
 Key: TIKA-4299
 URL: https://issues.apache.org/jira/browse/TIKA-4299
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


This is a follow on to TIKA-4296. We should remove the "why this hack" comment 
and perhaps clean up small bits of other code related to pagination in 
AbstractPDF2XHTML.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4290) Fix code inspection anomalies

2024-08-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874900#comment-17874900
 ] 

Tim Allison commented on TIKA-4290:
---

Thank you [~tilman] and [~dkryukov]!

> Fix code inspection anomalies
> -
>
> Key: TIKA-4290
> URL: https://issues.apache.org/jira/browse/TIKA-4290
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4295) Allow bypass of emitKey in AbstractEmbeddedDocumentBytesHandler

2024-08-07 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4295.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Allow bypass of emitKey in AbstractEmbeddedDocumentBytesHandler
> ---
>
> Key: TIKA-4295
> URL: https://issues.apache.org/jira/browse/TIKA-4295
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> We currently algorithmically determine the emit path for the bytes from 
> embedded documents based on the json emit key. In other words, if the primary 
> emit key is {{/a/b/c.json}}, we emit to {{/a/b/c/c-1000.jpg}}, for example.
> We should allow users to set a custom emit key root path so that, for 
> example, users could have the json going here: {{/a/b/json/abc.json}} and the 
> bytes going here: {{/a/b/bytes/abc-.doc}} 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4295) Allow bypass of emitKey in AbstractEmbeddedDocumentBytesHandler

2024-08-07 Thread Tim Allison (Jira)
Tim Allison created TIKA-4295:
-

 Summary: Allow bypass of emitKey in 
AbstractEmbeddedDocumentBytesHandler
 Key: TIKA-4295
 URL: https://issues.apache.org/jira/browse/TIKA-4295
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


We currently algorithmically determine the emit path for the bytes from 
embedded documents based on the json emit key. In other words, if the primary 
emit key is {{/a/b/c.json}}, we emit to {{/a/b/c/c-1000.jpg}}, for example.

We should allow users to set a custom emit key root path so that, for example, 
users could have the json going here: {{/a/b/json/abc.json}} and the bytes 
going here: {{/a/b/bytes/abc-.doc}} 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4235) Add pipeline parameter to OpenSearch emitter

2024-08-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4235.
---
Resolution: Won't Fix

Reopen if needed

> Add pipeline parameter to OpenSearch emitter
> 
>
> Key: TIKA-4235
> URL: https://issues.apache.org/jira/browse/TIKA-4235
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4294.
---
Resolution: Fixed

Sorry for all the noise on this one.

> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-4294:
---

Should include some earlier simplification proposals from 
https://github.com/apache/tika/pull/1805

> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-05 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4294.
---
Resolution: Fixed

> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-08-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871163#comment-17871163
 ] 

Tim Allison commented on TIKA-4251:
---

My sense is that at some point, we have to throw up our hands and hope. The 
initial commit will definitely be too large to review.

Perhaps I misunderstand, [~ndipiazza]?

I'm also still mildly annoyed that we'd still have to use checkstyle to 
prohibit wildcard imports, if I understand cosium + google format plugin 
correctly. Would we also have still use checkstyle for license headers?

I still think this is a better option than what we have now.

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-08-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871162#comment-17871162
 ] 

Tim Allison commented on TIKA-4251:
---

Use intellij using the checkstyle profile? Checkstyle set to google? Or 
Intellij's "Code Style->Scheme" set to google (e.g. 
https://github.com/google/styleguide/blob/gh-pages/intellij-java-google-style.xml)?

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871159#comment-17871159
 ] 

Tim Allison commented on TIKA-4294:
---

While adding a unit test, I found that we should also implement the TODO -- 
deserialize objects in lists. Once there's a clean build, I'll merge.

> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-05 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-4294:
---
  Assignee: Tim Allison

> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871137#comment-17871137
 ] 

Tim Allison commented on TIKA-4294:
---

K. Got it. This fixes the key to be the super class -- 
https://github.com/apache/tika/pull/1887/. I'll add a unit test before merging 
to guarantee expected behavior.

> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871131#comment-17871131
 ] 

Tim Allison commented on TIKA-4294:
---

This is an example of what the json might look like. 

{code:json}
 "parseContext": {
"org.apache.tika.metadata.filter.MetadataFilter": {
  "_class": "org.apache.tika.metadata.filter.CompositeMetadataFilter",
  "filters": [
{
  "_class": "org.apache.tika.metadata.filter.SomeFilter",
...
{code}


> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871130#comment-17871130
 ] 

Tim Allison commented on TIKA-4294:
---

The key in ParseContext should be {{superClazz}}, and I have a PR to fix that. 
:/

> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871128#comment-17871128
 ] 

Tim Allison commented on TIKA-4294:
---

Thank you, @tilman. Apologies... will {{className}} never equal 
{{superClassName}}? 

And, y, my thinking was that we'd avoid having to call {{Class.forName}} twice 
if we're doing it for the same class -- imaginary efficiency. :D

> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-05 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4294.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Thank you [~dimirsen] and [~tilman]!

> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-08-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871095#comment-17871095
 ] 

Tim Allison commented on TIKA-4252:
---

Thank you [~tilman]! I'll work cleaning this up here: 
https://issues.apache.org/jira/browse/TIKA-4294

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-05 Thread Tim Allison (Jira)
Tim Allison created TIKA-4294:
-

 Summary: Simplify serialization/deserialization of ParseContext
 Key: TIKA-4294
 URL: https://issues.apache.org/jira/browse/TIKA-4294
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
serialization and deserialization of ParseContext to avoid redundancy of the 
superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4291) In JDBCEmitter local var dateFormats shadows class filed with the same name

2024-08-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871086#comment-17871086
 ] 

Tim Allison commented on TIKA-4291:
---

LGTM. Thank you!

> In JDBCEmitter local var dateFormats shadows class filed with the same name
> ---
>
> Key: TIKA-4291
> URL: https://issues.apache.org/jira/browse/TIKA-4291
> Project: Tika
>  Issue Type: Bug
>  Components: tika-pipes
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: Dmitrii Kriukov
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> Line 338 of  JDBCEmitter
> Local variable dateFormats is created, populated with values, but never used 
> in its scope.
> It's not clear how to fix. Was it planned to use class field with the same 
> type and name?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4289) Further improvements to the metadata filter and serialization

2024-07-31 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4289.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Further improvements to the metadata filter and serialization
> -
>
> Key: TIKA-4289
> URL: https://issues.apache.org/jira/browse/TIKA-4289
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4288) Allow user configuration of MetadataFilters in PipesServer

2024-07-30 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4288.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Allow user configuration of MetadataFilters in PipesServer
> --
>
> Key: TIKA-4288
> URL: https://issues.apache.org/jira/browse/TIKA-4288
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 3.0.0
>
>
> We're currently configuring metadata filters in tika-config.xml. It would be 
> helpful (at least in the PipesServer?) to allow users to configure metadata 
> filters at parse time via the ParseContext.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4287) Improve PDFParserConfig serialization

2024-07-30 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4287.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Improve PDFParserConfig serialization
> -
>
> Key: TIKA-4287
> URL: https://issues.apache.org/jira/browse/TIKA-4287
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> I thought I had fixed this as part of the parsecontext serialization work. 
> The fix didn't make it into main.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4289) Further improvements to the metadata filter and serialization

2024-07-30 Thread Tim Allison (Jira)
Tim Allison created TIKA-4289:
-

 Summary: Further improvements to the metadata filter and 
serialization
 Key: TIKA-4289
 URL: https://issues.apache.org/jira/browse/TIKA-4289
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4288) Allow user configuration of MetadataFilters in PipesServer

2024-07-30 Thread Tim Allison (Jira)
Tim Allison created TIKA-4288:
-

 Summary: Allow user configuration of MetadataFilters in PipesServer
 Key: TIKA-4288
 URL: https://issues.apache.org/jira/browse/TIKA-4288
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


We're currently configuring metadata filters in tika-config.xml. It would be 
helpful (at least in the PipesServer?) to allow users to configure metadata 
filters at parse time via the ParseContext.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4287) Improve PDFParserConfig serialization

2024-07-30 Thread Tim Allison (Jira)
Tim Allison created TIKA-4287:
-

 Summary: Improve PDFParserConfig serialization
 Key: TIKA-4287
 URL: https://issues.apache.org/jira/browse/TIKA-4287
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


I thought I had fixed this as part of the parsecontext serialization work. The 
fix didn't make it into main.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release

2024-07-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868724#comment-17868724
 ] 

Tim Allison commented on TIKA-4280:
---

Y, probably? I wasn't thinking of changing tika-server for 3.x just yet, but I 
thought we could start fresh with non-shaded jars with the tika-grpc server?

I'd hope for something similar to/as simple as this line: 
https://github.com/apache/tika-docker/blob/main/minimal/Dockerfile#L68 but know 
that .bat and .sh files get more complex shortly after the first proof of 
concept is developed. :D

What do you think?

One option for tika-server, is that we move to non-shaded jars in the service 
scripts and actually fix them so that they work.




> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4280) Tasks for the 3.0.0 release

2024-07-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868724#comment-17868724
 ] 

Tim Allison edited comment on TIKA-4280 at 7/25/24 3:49 PM:


Y, probably? I wasn't thinking of changing tika-server for 3.x just yet, but I 
thought we could start fresh with non-shaded jars with the tika-grpc server?

I'd hope for something similar to/as simple as this line: 
https://github.com/apache/tika-docker/blob/main/minimal/Dockerfile#L68 but know 
that .bat and .sh files get more complex shortly after the first proof of 
concept is developed. :D

What do you think?

One option for tika-server, is that we keep the fat jar, but that we also move 
to non-shaded jars in the service scripts and actually fix them so that they 
work.





was (Author: talli...@mitre.org):
Y, probably? I wasn't thinking of changing tika-server for 3.x just yet, but I 
thought we could start fresh with non-shaded jars with the tika-grpc server?

I'd hope for something similar to/as simple as this line: 
https://github.com/apache/tika-docker/blob/main/minimal/Dockerfile#L68 but know 
that .bat and .sh files get more complex shortly after the first proof of 
concept is developed. :D

What do you think?

One option for tika-server, is that we move to non-shaded jars in the service 
scripts and actually fix them so that they work.




> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4285) Invalid Link for changelog CHANGES.txt files

2024-07-22 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4285.
---
Resolution: Fixed

Thank you [~tom_1st] and [~tilman]! Should be fixed now.

> Invalid Link for changelog CHANGES.txt files
> 
>
> Key: TIKA-4285
> URL: https://issues.apache.org/jira/browse/TIKA-4285
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.0, 2.9.1, 2.9.2
>Reporter: Lonzak
>Assignee: Tim Allison
>Priority: Major
>
> On the tika [start page|https://tika.apache.org/] the linked change log files 
> CHANGES.txt starting with version 2.9.0 are missing/broken.
>  
> {+}Working{+}:
> [https://archive.apache.org/dist/tika/2.8.0/CHANGES-2.8.0.txt]
> [https://dist.apache.org/repos/dist/release/tika/3.0.0-BETA2/CHANGES-3.0.0-BETA2.txt]
> +Not working:+
> [https://archive.apache.org/dist/]{-}{color:#ff}release{color}{-}/tika/2.9.0/CHANGES-2.9.0.txt
> [https://archive.apache.org/dist/]{-}{color:#ff}release{color}{-}/tika/2.9.1/CHANGES-2.9.1.txt
> [https://archive.apache.org/dist/]{-}{color:#ff}release{color}{-}/tika/2.9.2/CHANGES-2.9.2.txt
> [https://archive.apache.org/dist/tika/3.0.0-BETA/CHANGES-3.0.0.txt]
>  
> +*Wrong Text*+ (as mention by Tilman)
> 15 July 2024: Apache Tika Release Apache Tika *3.0.0-BETA2* has been 
> released! This release includes several bug fixes and dependency upgrades. 
> Please see the 
> [CHANGES.txt|https://dist.apache.org/repos/dist/release/tika/3.0.0-BETA2/CHANGES-3.0.0-BETA2.txt]
>  file for the full list of changes in the release and have a look at the 
> download page for more information on how to obtain{color:#FF} *Apache 
> Tika 2.9.2.*{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (TIKA-4285) Invalid Link for changelog CHANGES.txt files

2024-07-22 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-4285:
-

Assignee: Tim Allison

> Invalid Link for changelog CHANGES.txt files
> 
>
> Key: TIKA-4285
> URL: https://issues.apache.org/jira/browse/TIKA-4285
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.0, 2.9.1, 2.9.2
>Reporter: Lonzak
>Assignee: Tim Allison
>Priority: Major
>
> On the tika [start page|https://tika.apache.org/] the linked change log files 
> CHANGES.txt starting with version 2.9.0 are missing/broken.
>  
> {+}Working{+}:
> [https://archive.apache.org/dist/tika/2.8.0/CHANGES-2.8.0.txt]
> [https://dist.apache.org/repos/dist/release/tika/3.0.0-BETA2/CHANGES-3.0.0-BETA2.txt]
> +Not working:+
> [https://archive.apache.org/dist/]{-}{color:#ff}release{color}{-}/tika/2.9.0/CHANGES-2.9.0.txt
> [https://archive.apache.org/dist/]{-}{color:#ff}release{color}{-}/tika/2.9.1/CHANGES-2.9.1.txt
> [https://archive.apache.org/dist/]{-}{color:#ff}release{color}{-}/tika/2.9.2/CHANGES-2.9.2.txt
> [https://archive.apache.org/dist/tika/3.0.0-BETA/CHANGES-3.0.0.txt]
>  
> +*Wrong Text*+ (as mention by Tilman)
> 15 July 2024: Apache Tika Release Apache Tika *3.0.0-BETA2* has been 
> released! This release includes several bug fixes and dependency upgrades. 
> Please see the 
> [CHANGES.txt|https://dist.apache.org/repos/dist/release/tika/3.0.0-BETA2/CHANGES-3.0.0-BETA2.txt]
>  file for the full list of changes in the release and have a look at the 
> download page for more information on how to obtain{color:#FF} *Apache 
> Tika 2.9.2.*{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4281) Fix javadoc plugin configuration

2024-07-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866843#comment-17866843
 ] 

Tim Allison commented on TIKA-4281:
---

For some reason, now, it looks like {{javadocs}} works fine with {{install}} as 
long as {{install}} is after it in the commandline? Ha, but then the packaging 
step doesn't work..

If anyone knows what's going on, please help. This is weird.

> Fix javadoc plugin configuration
> 
>
> Key: TIKA-4281
> URL: https://issues.apache.org/jira/browse/TIKA-4281
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> We added src/main/java to our parent pom back in 
> April when we updated the Apache parent pom. This line prevented the 
> generation of javadocs for the 3.0.0-BETA2 release. I don't know if we need 
> that line with our current version of the ASF parent pom.
> Let's figure it out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4281) Fix javadoc plugin configuration

2024-07-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866742#comment-17866742
 ] 

Tim Allison commented on TIKA-4281:
---

Well, that didn't work: {{javadoc: error - No source files for package ...}}. 
This is likely the problem that lead to adding the sourcepath.

What I can't figure out is that {{javadocs:aggregate}} ran without any 
complaints even in debug mode (that I could find in the voluminous logs) but 
left nothing {{target/site}}. When I removed the  line in the 
3.0.0-BETA2 release's pom, {{javadocs:aggregate}} worked as expected. However, 
now we're getting that javadoc error in ci/cd, and I'm getting a related one -- 
just a different package -- locally.

> Fix javadoc plugin configuration
> 
>
> Key: TIKA-4281
> URL: https://issues.apache.org/jira/browse/TIKA-4281
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> We added src/main/java to our parent pom back in 
> April when we updated the Apache parent pom. This line prevented the 
> generation of javadocs for the 3.0.0-BETA2 release. I don't know if we need 
> that line with our current version of the ASF parent pom.
> Let's figure it out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4281) Fix javadoc plugin configuration

2024-07-16 Thread Tim Allison (Jira)
Tim Allison created TIKA-4281:
-

 Summary: Fix javadoc plugin configuration
 Key: TIKA-4281
 URL: https://issues.apache.org/jira/browse/TIKA-4281
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


We added src/main/java to our parent pom back in April 
when we updated the Apache parent pom. This line prevented the generation of 
javadocs for the 3.0.0-BETA2 release. I don't know if we need that line with 
our current version of the ASF parent pom.

Let's figure it out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4280) Tasks for the 3.0.0 release

2024-07-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4280:
--
Description: 
I'm too lazy to open separate tickets. Please do so if desired.

Some items:
* Before releasing the real 3.0.0 we need to remove any "-M" dependencies
* Decide about the ffmpeg issue and the hdf5 issue
* Run the regression tests vs 2.9.x
* Convert tika-grpc to use the dependency plugin instead of the shade plugin
* Turn javadocs back on. I got errors during the deploy process because javadoc 
needed the auto-generated code ("cannot find symbol DeleteFetcherRequest"). We 
need to enable javadocs for the rest of the project.

Other things? Thank you [~tilman] for the first two!

  was:
I'm too lazy to open separate tickets. Please do so if desired.

Some items:
* Before releasing the real 3.0.0 we need to remove any "-M" dependencies
* Decide about the ffmpeg issue and the hdf5 issue
* Run the regression tests vs 2.9.x
* Convert tika-grpc to use the dependency plugin instead of the shade plugin
* Turn javadocs back on. I got errors during the deploy process because javadoc 
did not like the auto-generated code. We need to enable javadocs for the rest 
of the project.

Other things? Thank you [~tilman] for the first two!


> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4280) Tasks for the 3.0.0 release

2024-07-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4280:
--
Description: 
I'm too lazy to open separate tickets. Please do so if desired.

Some items:
* Before releasing the real 3.0.0 we need to remove any "-M" dependencies
* Decide about the ffmpeg issue and the hdf5 issue
* Run the regression tests vs 2.9.x
* Convert tika-grpc to use the dependency plugin instead of the shade plugin
* Turn javadocs back on. I got errors during the deploy process because javadoc 
did not like the auto-generated code. We need to enable javadocs for the rest 
of the project.

Other things? Thank you [~tilman] for the first two!

  was:
I'm too lazy to open separate tickets. Please do so if desired.

Some items:
* Before releasing the real 3.0.0 we need to remove any "-M" dependencies
* Decide about the ffmpeg issue and the hdf5 issue
* Run the regression tests vs 2.9.x
* Convert tika-grpc to use the dependency plugin instead of the shade plugin

Other things? Thank you [~tilman] for the first two!


> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc did not like the auto-generated code. We need to enable javadocs for 
> the rest of the project.
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4280) Tasks for the 3.0.0 release

2024-07-15 Thread Tim Allison (Jira)
Tim Allison created TIKA-4280:
-

 Summary: Tasks for the 3.0.0 release
 Key: TIKA-4280
 URL: https://issues.apache.org/jira/browse/TIKA-4280
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


I'm too lazy to open separate tickets. Please do so if desired.

Some items:
* Before releasing the real 3.0.0 we need to remove any "-M" dependencies
* Decide about the ffmpeg issue and the hdf5 issue
* Run the regression tests vs 2.9.x
* Convert tika-grpc to use the dependency plugin instead of the shade plugin

Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866075#comment-17866075
 ] 

Tim Allison commented on TIKA-4278:
---

Thank you, [~tilman], y, that's probably an oversight on my part in the initial 
commits. Thank you for working on this.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4275) Make tika-grpc a top-level module

2024-07-09 Thread Tim Allison (Jira)
Tim Allison created TIKA-4275:
-

 Summary: Make tika-grpc a top-level module
 Key: TIKA-4275
 URL: https://issues.apache.org/jira/browse/TIKA-4275
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


I'd like to move tika-grpc to the top level to be at the same level as 
tika-server. Any objections?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4272) create tika docker image for tika-grpc

2024-06-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860241#comment-17860241
 ] 

Tim Allison commented on TIKA-4272:
---

Y, I concur, we should have a completely separate image.

> create tika docker image for tika-grpc
> --
>
> Key: TIKA-4272
> URL: https://issues.apache.org/jira/browse/TIKA-4272
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> now that the tika-grpc branch has been merge to main, we need a tika-grpc 
> server image. 
> i thought for a bit about using the same tika docker image as we already use 
> but that is probably not a good idea because there are vastly different jar 
> files needed for tika-grpc 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860035#comment-17860035
 ] 

Tim Allison commented on TIKA-4251:
---

W00t!

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860020#comment-17860020
 ] 

Tim Allison commented on TIKA-4251:
---

Sounds great. My personal preference would be to move away from our custom 
formatting to a standard, maybe google? I'd also like to forbid wildcarding, if 
possible.

Fellow devs, any objections? Is that possible [~ndipiazza]?



> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860007#comment-17860007
 ] 

Tim Allison commented on TIKA-4251:
---

>  we eat the 1-time-format cost
That's where the vulnerability is. That one huge one-time commit that'll be too 
big to review. :D

Y, I completely agree, and I'm eager to move forward...with fingers crossed. If 
I never get another checkstyle failed build, I will be happy. :D

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1785#comment-1785
 ] 

Tim Allison commented on TIKA-4251:
---

Makes sense. Tilman's observation is legit, and I don't see a way around it. We 
took that risk with my commits to get checkstyle to work, but there's now a new 
supply chain vuln by allowing the plugin to make more changes than we can 
reasonably review.

Cross our fingers and hope?

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859739#comment-17859739
 ] 

Tim Allison edited comment on TIKA-4251 at 6/25/24 6:19 PM:


Y. I agree. When I started with checkstyle. I had to modify a lot of files. Any 
recs for mitigating this?


was (Author: talli...@mitre.org):
Y. I agree. When I started with checkstyle, it modified nearly every file. Any 
recs for mitigating this?

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859739#comment-17859739
 ] 

Tim Allison commented on TIKA-4251:
---

Y. I agree. When I started with checkstyle, it modified nearly every file. Any 
recs for mitigating this?

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-06-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853241#comment-17853241
 ] 

Tim Allison commented on TIKA-4243:
---

This is what the json currently looks like.

{code:json}
{
"emitter": "fse",
"fetchKey": "testPDFTwoTextBoxes.pdf",
"fetcher": "fsf",
"id": "myId",
"onParseException": "emit",
"parseContext": {
"org.apache.tika.parser.pdf.PDFParserConfig": {
"_class": "org.apache.tika.parser.pdf.PDFParserConfig",
"accessChecker": {
"_class": "org.apache.tika.parser.pdf.AccessChecker"
},
"averageCharTolerance": 0.3,
"catchIntermediateIOExceptions": true,
"detectAngles": false,
"dropThreshold": 2.5,
"enableAutoSpace": true,
"extractAcroFormContent": true,
"extractActions": false,
"extractAnnotationText": true,
"extractBookmarksText": true,
"extractFontNames": false,
"extractIncrementalUpdateInfo": false,
"extractInlineImages": false,
"extractMarkedContent": false,
"extractUniqueInlineImagesOnly": true,
"ifXFAExtractOnlyXFA": false,
"imageGraphicsEngineFactory": {
"_class": 
"org.apache.tika.parser.pdf.image.ImageGraphicsEngineFactory"
},
"imageStrategy": "NONE",
"maxIncrementalUpdates": 10,
"maxMainMemoryBytes": 536870912,
"ocrDPI": 300,
"ocrImageFormatName": "png",
"ocrImageQuality": 1.0,
"ocrImageType": "GRAY",
"ocrRenderingStrategy": "ALL",
"ocrStrategy": "AUTO",
"parseIncrementalUpdates": false,
"renderer": null,
"setKCMS": false,
"sortByPosition": true,
"spacingTolerance": 0.5,
"suppressDuplicateOverlappingText": false,
"throwOnEncryptedPayload": false
}
}
}{code}


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-06-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853240#comment-17853240
 ] 

Tim Allison commented on TIKA-4243:
---

I opened a PR with some cleanup, fixes and a new unit test that confirms that 
the PDFParserConfig actually works in the pipes endpoint in tika-server.

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4268) Use title for embedded resource path in embedded msg files

2024-06-07 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4268.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Use title for embedded resource path in embedded msg files
> --
>
> Key: TIKA-4268
> URL: https://issues.apache.org/jira/browse/TIKA-4268
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> If an msg file is embedded in an msg file, the embedded_resource_path 
> currently looks like this: {{/__substg1.0_3701000D.msg/attachment.docx}}. 
> A more human-friendly path would be: {{/Test Attachment.msg/attachment.docx}}
> We should update our PST parser as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853157#comment-17853157
 ] 

Tim Allison commented on TIKA-4251:
---

Unless there are any objections, I'll likely move forward with this early this 
coming week. If this will break any outstanding PRs or other work or if anyone 
thinks this is a bad idea, please let me know.

This will only affect main/3.x. I am not going to back port this to 2.x.

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4268) Use title for embedded resource path in embedded msg files

2024-06-07 Thread Tim Allison (Jira)
Tim Allison created TIKA-4268:
-

 Summary: Use title for embedded resource path in embedded msg files
 Key: TIKA-4268
 URL: https://issues.apache.org/jira/browse/TIKA-4268
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


If an msg file is embedded in an msg file, the embedded_resource_path currently 
looks like this: {{/__substg1.0_3701000D.msg/attachment.docx}}. 

A more human-friendly path would be: {{/Test Attachment.msg/attachment.docx}}

We should update our PST parser as well.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852876#comment-17852876
 ] 

Tim Allison edited comment on TIKA-4243 at 6/6/24 5:39 PM:
---

I think our joint recent PR on TIKA-4252 accomplishes the goals of this ticket. 
There's more work, but I think we can close this out.

If we do want to head down the jsonschema route later, let's open a new ticket?


was (Author: talli...@mitre.org):
I think our joint recent PR on TIKA-4252 accomplishes the goals of this ticket. 
There's more work, but I think we can close this out.

If we do want to head down the jsonschema root later, let's open a new ticket?

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-06-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852876#comment-17852876
 ] 

Tim Allison commented on TIKA-4243:
---

I think our joint recent PR on TIKA-4252 accomplishes the goals of this ticket. 
There's more work, but I think we can close this out.

If we do want to head down the jsonschema root later, let's open a new ticket?

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-06-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852874#comment-17852874
 ] 

Tim Allison commented on TIKA-4252:
---

K. I think we're at "good enough" here. [~ndipiazza], thank you and take it 
away!

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-06-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4252.
---
Resolution: Fixed

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-06-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852808#comment-17852808
 ] 

Tim Allison commented on TIKA-4243:
---

Oh, and documentation, lots of documentation. :LOL:

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852804#comment-17852804
 ] 

Tim Allison edited comment on TIKA-4243 at 6/6/24 2:11 PM:
---

Current status on TIKA-4243 branch -- works up through and including tika-app

Still need:
* better job of handling lists and maps as parameters and types.
* test tika-server pipes/ and async/ endpoints
* more unit tests in new serialization stuff

Ongoing needs:
* modify config objects so that they work with the serialization methods



was (Author: talli...@mitre.org):
Current status on TIKA-4243 -- works up through and including tika-app

Still need:
* better job of handling lists and maps as parameters and types.
* test tika-server pipes/ and async/ endpoints
* more unit tests in new serialization stuff

Ongoing needs:
* modify config objects so that they work with the serialization methods


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-06-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852804#comment-17852804
 ] 

Tim Allison commented on TIKA-4243:
---

Current status on TIKA-4243 -- works up through and including tika-app

Still need:
* better job of handling lists and maps as parameters and types.
* test tika-server pipes/ and async/ endpoints
* more unit tests in new serialization stuff

Ongoing needs:
* modify config objects so that they work with the serialization methods


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-06-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852098#comment-17852098
 ] 

Tim Allison commented on TIKA-4243:
---

Let me know if there are any objections to heading in this direction.

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-06-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852097#comment-17852097
 ] 

Tim Allison commented on TIKA-4243:
---

K, I chatted briefly with [~ndipiazza] this morning. Unless there are 
objections, the simplest way forward is to build our own for now, we think. It 
won't be much work given the stuff I already did for xml...famous last words. 
This will keep jackson annotations out of tika-core.

This will get us to 3.x and then [~ndipiazza] can refactor with the jsonschema 
stuff later.

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 5:10 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
"org.apache.tika.parser.pdf.image.ImageGraphicsEngineFactory.class: {
"_class":"com.tika.custom.OurCompanysFactory",
   "speed":"blazing",
   "dpi":1000
}
   },
   "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
"ocrDPI":300,
"sortByPosition": true,
   },
   "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 5:02 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
"ocrDPI":300,
"sortByPosition": true,
   },
   "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
"ocrDPI":300,
"sortByPosition": true,
   },
   { "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Is

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 5:02 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
"ocrDPI":300,
"sortByPosition": true,
   },
   { "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json:
{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority:

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 4:45 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json:
{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json:
{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

* What I don't like about this is that we're back in the game of creating our 
own serialization framework. :( *

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of d

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 4:45 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json:
{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

* What I don't like about this is that we're back in the game of creating our 
own serialization framework. :( *


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{parseContext.set(Parser.class, new EmptyParser())}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter -- for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig. 

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json: 

{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}

Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(
*


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doi

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
 ] 

Tim Allison commented on TIKA-4243:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{parseContext.set(Parser.class, new EmptyParser())}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter -- for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig. 

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json: 

{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}

Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(
*


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-06-03 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4260.
---
Resolution: Duplicate

Turns out this is a duplicate. Onwards to TIKA-4243!

> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4266) Improve multithreading and the xml parser pools in XMLUtils

2024-05-30 Thread Tim Allison (Jira)
Tim Allison created TIKA-4266:
-

 Summary: Improve multithreading and the xml parser pools in 
XMLUtils
 Key: TIKA-4266
 URL: https://issues.apache.org/jira/browse/TIKA-4266
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


I recently came across a build failure when running maven multithreaded 
{{-T10}}. The cause was a race condition/thread contention in XMLUtils. I don't 
_think_ anyone has seen this in practice in the wild, but we should improve the 
thread-safety of XMLUtils.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4221) Regression in pack200 parsing in commons-compress

2024-05-30 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4221.
---
Fix Version/s: 3.0.0
   2.9.3
   Resolution: Fixed

Many thanks to [~ggregory] and {{commons-compress}}!

> Regression in pack200 parsing in commons-compress
> -
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> There's a regression in pack200 that leads to the InputStream being closed 
> even if wrapped in a CloseShieldInputStream.
> This was the original signal that something was wrong, but the real problem 
> is in pack200, not xz.
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   ... 85 more



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4220) Commons-compress too lenient on headless tar detection

2024-05-30 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4220.
---
Fix Version/s: 3.0.0
   2.9.3
   Resolution: Fixed

Many thanks to [~ggregory] and {{commons-compress}}!

> Commons-compress too lenient on headless tar detection
> --
>
> Key: TIKA-4220
> URL: https://issues.apache.org/jira/browse/TIKA-4220
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>
> On recent regression tests on TIKA-4218, we noticed a fairly major change 
> with an increased rate of false positives on headless tar detection from 
> commons-compress.
> I think for now we should copy/paste/fork the headless tar detection and 
> improve it/revert it or possibly remove it for our 2.9.2 release.
> On this ticket, I'll look into what changed recently in headless tar 
> detection in commons-compress and experiment with fixing it.
> One challenge is that our magic bytes detection happens _after_ our custom 
> detectors, which means that we can't put a low confidence on what comes out 
> of our custom detectors and let the magic detection fix it. We could  
> implement an x-tar special case, but I really don't like that.
> Let's see what we can do...
> The numbers below represent the number of files identified as A (in tika 
> 2.9.1) -> B (in tika-2.9.2-pre-rc1).
> application/octet-stream -> application/x-tar 826
> multipart/appledouble -> application/x-tar701
> image/x-tga -> application/x-tar  322
> image/vnd.microsoft.icon -> application/x-tar 312
> application/vnd.iccprofile -> application/x-tar   221
> video/mp4 -> application/x-tar177
> audio/mpeg -> application/x-tar   59
> video/x-m4v -> application/x-tar  59
> application/x-font-printer-metric -> application/x-tar36
> audio/mp4 -> application/x-tar25
> application/x-tex-tfm -> application/x-tar18
> image/x-pict -> application/x-tar 15
> image/png -> application/x-tar8
> text/plain; charset=ISO-8859-1 -> application/x-tar   8
> application/x-endnote-style -> application/x-tar  7
> application/x-font-ttf -> application/x-tar   6



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4265) Consider adding maven build cache extension

2024-05-30 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850776#comment-17850776
 ] 

Tim Allison commented on TIKA-4265:
---

It doesn't help at all if there's a modification in tika-core, even in a unit 
test, but I think this will be quite helpful when working on other modules, 
especially those not so early in the build tree.

> Consider adding maven build cache extension
> ---
>
> Key: TIKA-4265
> URL: https://issues.apache.org/jira/browse/TIKA-4265
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This would  be for 3.x. It wouldn't speed up ci/cd, but it may help with 
> local builds. It requires maven >= 3.9
> Has anyone used it? Think it will be a good fit for Tika?
> https://maven.apache.org/extensions/maven-build-cache-extension/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4265) Consider adding maven build cache extension

2024-05-30 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850773#comment-17850773
 ] 

Tim Allison commented on TIKA-4265:
---

I just pushed a demo to {{build-cache}}. This includes documentation in the 
README.md on how to turn off the build cache

> Consider adding maven build cache extension
> ---
>
> Key: TIKA-4265
> URL: https://issues.apache.org/jira/browse/TIKA-4265
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This would  be for 3.x. It wouldn't speed up ci/cd, but it may help with 
> local builds. It requires maven >= 3.9
> Has anyone used it? Think it will be a good fit for Tika?
> https://maven.apache.org/extensions/maven-build-cache-extension/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4265) Consider adding maven build cache extension

2024-05-30 Thread Tim Allison (Jira)
Tim Allison created TIKA-4265:
-

 Summary: Consider adding maven build cache extension
 Key: TIKA-4265
 URL: https://issues.apache.org/jira/browse/TIKA-4265
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


This would  be for 3.x. It wouldn't speed up ci/cd, but it may help with local 
builds. It requires maven >= 3.9

Has anyone used it? Think it will be a good fit for Tika?

https://maven.apache.org/extensions/maven-build-cache-extension/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4261) Add attachment type metadata filter

2024-05-24 Thread Tim Allison (Jira)
Tim Allison created TIKA-4261:
-

 Summary: Add attachment type metadata filter
 Key: TIKA-4261
 URL: https://issues.apache.org/jira/browse/TIKA-4261
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


For some users who are using the /rmeta endpoint or -J option in tika-app, 
inlining ocr'd content, there is no need to include the metadata object for the 
inlined image. Let's add a metadata filter to remove these metadata objects.

The default behavior will be as before. Everything is included. Users need to 
configure this to remove these inline objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4259) Decouple xml parser stuff from ParseContext

2024-05-24 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4259.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Decouple xml parser stuff from ParseContext
> ---
>
> Key: TIKA-4259
> URL: https://issues.apache.org/jira/browse/TIKA-4259
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> ParseContext has some xmlreader convenience methods. We should move those to 
> XMLReaderUtils in 3.x to simplify ParseContext's api.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849298#comment-17849298
 ] 

Tim Allison commented on TIKA-4260:
---

That PR currently only works on tika-core. More needs to be done before we can 
merge this if this is the direction we'd like to go.

> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849288#comment-17849288
 ] 

Tim Allison commented on TIKA-4243:
---

[~ndipiazza], I added parseContext to fetchers and emitters on the TIKA-4260 
branch. That might be a good start for serializing the ParseContext?

All, let me know what you think.

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849103#comment-17849103
 ] 

Tim Allison edited comment on TIKA-4243 at 5/24/24 1:00 PM:


Proposed basic roadmap:

Add parseContext to fetchers and emitters (and pipesReporter?)
Serialize ParseContext as is...
Allow for serialization of current XConfigs, eg. PDFParserConfig, etc.
Add creation of parsers with e.g. new PDFParser(ParseContext context).
Wire config stuff into tika-server, tika-pipes, tika-app
Merge tika-grpc-server with new config options

This would require serialization of classes that users want to be able to 
configure + serialization.

This would allow us to get rid of all of our custom serialization stuff for 
Tika 4.x.



was (Author: talli...@mitre.org):
Proposed basic roadmap:

Serialize ParseContext as is...
Allow for serialization of current XConfigs, eg. PDFParserConfig, etc.
Add creation of parsers with e.g. new PDFParser(ParseContext context).
Wire config stuff into tika-server, tika-pipes, tika-app
Merge tika-grpc-server with new config options

This would require serialization of classes that users want to be able to 
configure + serialization.

This would allow us to get rid of all of our custom serialization stuff for 
Tika 4.x.


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-05-23 Thread Tim Allison (Jira)
Tim Allison created TIKA-4260:
-

 Summary: Add parse context to the fetcher interface in 3.x
 Key: TIKA-4260
 URL: https://issues.apache.org/jira/browse/TIKA-4260
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4259) Decouple xml parser stuff from ParseContext

2024-05-23 Thread Tim Allison (Jira)
Tim Allison created TIKA-4259:
-

 Summary: Decouple xml parser stuff from ParseContext
 Key: TIKA-4259
 URL: https://issues.apache.org/jira/browse/TIKA-4259
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


ParseContext has some xmlreader convenience methods. We should move those to 
XMLReaderUtils in 3.x to simplify ParseContext's api.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849114#comment-17849114
 ] 

Tim Allison commented on TIKA-4243:
---

I'm going to start working on PRs that will be generally helpful for the above, 
but they'll still be useful if we all choose a different direction. I'll hold 
off on the core work for a bit in case there are objections or better ways 
forward.

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849108#comment-17849108
 ] 

Tim Allison commented on TIKA-4243:
---

The downsides we see:
a) if we there's agreement to add jackson-annotations to tika-core, we add a 
few kb to tika-core
b) we're at risk of having jackson-annotations sprinkled throughout our 
codebase on the XConfig classes, but this is basically where we have our own 
@Field annotations now. So break even?
c)  Customized classes that need to be passed via the ParseContext will need to 
be serializable to be used in tika-server, tika-pipes...etc. anything that 
allows for configuration.

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >