[jira] [Updated] (TIKA-4310) Add CloseShield to JSoupParser
[ https://issues.apache.org/jira/browse/TIKA-4310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4310: -- Fix Version/s: 3.0.0 > Add CloseShield to JSoupParser > -- > > Key: TIKA-4310 > URL: https://issues.apache.org/jira/browse/TIKA-4310 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > The JsoupParser under the hood closes the reader and thereby the stream. This > breaks the normal parser contract that the parser does not close the stream. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4310) Add CloseShield to JSoupParser
[ https://issues.apache.org/jira/browse/TIKA-4310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4310. --- Resolution: Fixed > Add CloseShield to JSoupParser > -- > > Key: TIKA-4310 > URL: https://issues.apache.org/jira/browse/TIKA-4310 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > > The JsoupParser under the hood closes the reader and thereby the stream. This > breaks the normal parser contract that the parser does not close the stream. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4310) Add CloseShield to JSoupParser
Tim Allison created TIKA-4310: - Summary: Add CloseShield to JSoupParser Key: TIKA-4310 URL: https://issues.apache.org/jira/browse/TIKA-4310 Project: Tika Issue Type: Task Reporter: Tim Allison The JsoupParser under the hood closes the reader and thereby the stream. This breaks the normal parser contract that the parser does not close the stream. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8
[ https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880641#comment-17880641 ] Tim Allison commented on TIKA-4305: --- K. So there are two different issues. 1) Above, I was focusing on the difference in detection between UCS2 vs UTF-16 and UCS4 vs UTF32. There's not much we can do with that. 2) The other issue is that Tika can have a hard time determining that an InputStream is a text file unless the filename is included as a hint. Without the file name, Tika detects octet-stream. So, either of these work for communicating the file name to Tika: a)System.out.println(new Tika().parseToString(Paths.get(".../multilingual_test_new_UCS-2.txt"))); b)Metadata metadata = new Metadata(); try (InputStream is = TikaInputStream.get(Paths.get("/home/tallison/Downloads/multilingual_test_new_UCS-2.txt"), metadata)) { System.out.println(new Tika().parseToString(is, metadata)); } > Tika producing empty output for UCS encoded txt files; parses UTF-7 files as > UTF-8 > -- > > Key: TIKA-4305 > URL: https://issues.apache.org/jira/browse/TIKA-4305 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-core >Affects Versions: 2.9.2 > Environment: Ubuntu 22.04 LTS >Reporter: Manish S N >Priority: Minor > Attachments: multilingual_test_new_UCS-2.txt, > multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, > multilingual_test_new_UTF-8.txt, pom.xml, tika_UTF-7_output.txt > > > Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files. > No logs or errors just an empty string. > Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks > havoc with non ascii characters. > how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open > dialog of gedit and found the outputs similar > > I am attaching all four encoded files along with tika's output from parsing > the UTF-7 for reference -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8
[ https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880641#comment-17880641 ] Tim Allison edited comment on TIKA-4305 at 9/10/24 1:20 PM: K. So there are two different issues. 1) Above, I was focusing on the difference in detection between UCS2 vs UTF-16 and UCS4 vs UTF32. There's not much we can do with that. 2) The other issue is that Tika can have a hard time determining that an InputStream is a text file (esp for UTF16 and UTF32) unless the filename is included as a hint. Without the file name, Tika detects octet-stream. There are several ways to pass the filename hint to Tika. These are two of them: a)System.out.println(new Tika().parseToString(Paths.get(".../multilingual_test_new_UCS-2.txt"))); b)Metadata metadata = new Metadata(); try (InputStream is = TikaInputStream.get(Paths.get(".../multilingual_test_new_UCS-2.txt"), metadata)) { System.out.println(new Tika().parseToString(is, metadata)); } was (Author: talli...@mitre.org): K. So there are two different issues. 1) Above, I was focusing on the difference in detection between UCS2 vs UTF-16 and UCS4 vs UTF32. There's not much we can do with that. 2) The other issue is that Tika can have a hard time determining that an InputStream is a text file unless the filename is included as a hint. Without the file name, Tika detects octet-stream. So, either of these work for communicating the file name to Tika: a)System.out.println(new Tika().parseToString(Paths.get(".../multilingual_test_new_UCS-2.txt"))); b)Metadata metadata = new Metadata(); try (InputStream is = TikaInputStream.get(Paths.get(".../multilingual_test_new_UCS-2.txt"), metadata)) { System.out.println(new Tika().parseToString(is, metadata)); } > Tika producing empty output for UCS encoded txt files; parses UTF-7 files as > UTF-8 > -- > > Key: TIKA-4305 > URL: https://issues.apache.org/jira/browse/TIKA-4305 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-core >Affects Versions: 2.9.2 > Environment: Ubuntu 22.04 LTS >Reporter: Manish S N >Priority: Minor > Attachments: multilingual_test_new_UCS-2.txt, > multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, > multilingual_test_new_UTF-8.txt, pom.xml, tika_UTF-7_output.txt > > > Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files. > No logs or errors just an empty string. > Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks > havoc with non ascii characters. > how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open > dialog of gedit and found the outputs similar > > I am attaching all four encoded files along with tika's output from parsing > the UTF-7 for reference -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8
[ https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880641#comment-17880641 ] Tim Allison edited comment on TIKA-4305 at 9/10/24 1:18 PM: K. So there are two different issues. 1) Above, I was focusing on the difference in detection between UCS2 vs UTF-16 and UCS4 vs UTF32. There's not much we can do with that. 2) The other issue is that Tika can have a hard time determining that an InputStream is a text file unless the filename is included as a hint. Without the file name, Tika detects octet-stream. So, either of these work for communicating the file name to Tika: a)System.out.println(new Tika().parseToString(Paths.get(".../multilingual_test_new_UCS-2.txt"))); b)Metadata metadata = new Metadata(); try (InputStream is = TikaInputStream.get(Paths.get(".../multilingual_test_new_UCS-2.txt"), metadata)) { System.out.println(new Tika().parseToString(is, metadata)); } was (Author: talli...@mitre.org): K. So there are two different issues. 1) Above, I was focusing on the difference in detection between UCS2 vs UTF-16 and UCS4 vs UTF32. There's not much we can do with that. 2) The other issue is that Tika can have a hard time determining that an InputStream is a text file unless the filename is included as a hint. Without the file name, Tika detects octet-stream. So, either of these work for communicating the file name to Tika: a)System.out.println(new Tika().parseToString(Paths.get(".../multilingual_test_new_UCS-2.txt"))); b)Metadata metadata = new Metadata(); try (InputStream is = TikaInputStream.get(Paths.get("/home/tallison/Downloads/multilingual_test_new_UCS-2.txt"), metadata)) { System.out.println(new Tika().parseToString(is, metadata)); } > Tika producing empty output for UCS encoded txt files; parses UTF-7 files as > UTF-8 > -- > > Key: TIKA-4305 > URL: https://issues.apache.org/jira/browse/TIKA-4305 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-core >Affects Versions: 2.9.2 > Environment: Ubuntu 22.04 LTS >Reporter: Manish S N >Priority: Minor > Attachments: multilingual_test_new_UCS-2.txt, > multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, > multilingual_test_new_UTF-8.txt, pom.xml, tika_UTF-7_output.txt > > > Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files. > No logs or errors just an empty string. > Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks > havoc with non ascii characters. > how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open > dialog of gedit and found the outputs similar > > I am attaching all four encoded files along with tika's output from parsing > the UTF-7 for reference -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4307) Text in header not extracted for Microsoft Word doc file
[ https://issues.apache.org/jira/browse/TIKA-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880628#comment-17880628 ] Tim Allison commented on TIKA-4307: --- I asked for help from fellow POI devs: https://bz.apache.org/bugzilla/show_bug.cgi?id=69314 > Text in header not extracted for Microsoft Word doc file > > > Key: TIKA-4307 > URL: https://issues.apache.org/jira/browse/TIKA-4307 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: August Valera >Priority: Major > Attachments: 560702J-2x-converted.doc, 560702J-converted.docx, > 560702J-full-output.txt, 560702J.doc, screenshot-1.png > > > We have a Microsoft Word doc file with text in the header. That header text > is not successfully extracted alongside the file content, but converting the > file to a docx file results in successful extraction. > Samples are attached, conversion done using cloudconvert.com. > * [^560702J.doc] Original doc file, missing content > * [^560702J-converted.docx] Converted to docx file, correct output > * [^560702J-2x-converted.doc] Docx file converted back to doc, again missing > content > h3. Current Behavior > doc files omit header text. docx files extract header text correctly. > h3. Expected Behavior > doc and docx files with identical content in header should result in > identical output -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8
[ https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880594#comment-17880594 ] Tim Allison commented on TIKA-4305: --- Thank you. Y, as I mentioned above, I effectively conflated UCS2=UTF-16 and UCS4=UTF32 and declared victory. Are there significant differences in bytes in the UCS2 vs UTF-16 or for UCS4 vs UTF32 in the file you submitted? The UniversalCharsetDetector linked above does mention {{X-ISO-10646-UCS-4-3412 / X-ISO-10646-UCS-4-2143}} but it did not detect UCS4 on the test file. Unfortunately, as with UTF-7, the answer is basically the same: unless you can get the upstream projects to detect these charsets or unless you can find another detector with a friendly license, there's not much we can do. > Tika producing empty output for UCS encoded txt files; parses UTF-7 files as > UTF-8 > -- > > Key: TIKA-4305 > URL: https://issues.apache.org/jira/browse/TIKA-4305 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-core >Affects Versions: 2.9.2 > Environment: Ubuntu 22.04 LTS >Reporter: Manish S N >Priority: Minor > Attachments: multilingual_test_new_UCS-2.txt, > multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, > multilingual_test_new_UTF-8.txt, pom.xml, tika_UTF-7_output.txt > > > Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files. > No logs or errors just an empty string. > Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks > havoc with non ascii characters. > how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open > dialog of gedit and found the outputs similar > > I am attaching all four encoded files along with tika's output from parsing > the UTF-7 for reference -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8
[ https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880344#comment-17880344 ] Tim Allison commented on TIKA-4305: --- For those raising an eyebrow over anyone still using UTF-7, you are not alone: https://en.wikipedia.org/wiki/UTF-7 > Tika producing empty output for UCS encoded txt files; parses UTF-7 files as > UTF-8 > -- > > Key: TIKA-4305 > URL: https://issues.apache.org/jira/browse/TIKA-4305 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-core >Affects Versions: 2.9.2 > Environment: Ubuntu 22.04 LTS >Reporter: Manish S N >Priority: Minor > Attachments: multilingual_test_new_UCS-2.txt, > multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, > multilingual_test_new_UTF-8.txt, tika_UTF-7_output.txt > > > Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files. > No logs or errors just an empty string. > Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks > havoc with non ascii characters. > how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open > dialog of gedit and found the outputs similar > > I am attaching all four encoded files along with tika's output from parsing > the UTF-7 for reference -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8
[ https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880342#comment-17880342 ] Tim Allison commented on TIKA-4305: --- I get text that I think is correct with tika app 2.9.2: {{java -jar tika-app-2.9.2.jar multilingual_test_new_UCS-4.txt}} and when I run it against UCS-2. How are you running Tika when you get an empty string for the UTF-16 and UTF-32? > Tika producing empty output for UCS encoded txt files; parses UTF-7 files as > UTF-8 > -- > > Key: TIKA-4305 > URL: https://issues.apache.org/jira/browse/TIKA-4305 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-core >Affects Versions: 2.9.2 > Environment: Ubuntu 22.04 LTS >Reporter: Manish S N >Priority: Minor > Attachments: multilingual_test_new_UCS-2.txt, > multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, > multilingual_test_new_UTF-8.txt, tika_UTF-7_output.txt > > > Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files. > No logs or errors just an empty string. > Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks > havoc with non ascii characters. > how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open > dialog of gedit and found the outputs similar > > I am attaching all four encoded files along with tika's output from parsing > the UTF-7 for reference -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4305) Tika producing empty output for UCS encoded txt files; parses UTF-7 files as UTF-8
[ https://issues.apache.org/jira/browse/TIKA-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880341#comment-17880341 ] Tim Allison commented on TIKA-4305: --- Thank you for raising this issue. For the following, I'm running a unit test in Tika's main branch in the {{tika-parsers-standard-package}} module: * UTF-8 works * UCS-4 is detected correctly by the ICU4j detector as UTF-32BE. * UCS-2 is detected correctly by the ICU4j detector as UTF-16LE. * UTF-7 is incorrectly detected as windows-1252 by the UniversalCharsetDetector. If I turn off the UniversalCharsetDetector, the ICU4j detector incorrectly detects charset=ISO-8859-1 The fork of UniversalCharsetDetector that we use (https://github.com/albfernandez/juniversalchardet) does not claim to detect utf-7. ICU4j also does not detect utf-7 (https://unicode-org.github.io/icu/userguide/conversion/detection.html#detected-encodings). So, if you can open a ticket in one of those projects and/or identify another charset detector that has a friendly license and can detect utf-7, we should look into adding that to Tika. > Tika producing empty output for UCS encoded txt files; parses UTF-7 files as > UTF-8 > -- > > Key: TIKA-4305 > URL: https://issues.apache.org/jira/browse/TIKA-4305 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-core >Affects Versions: 2.9.2 > Environment: Ubuntu 22.04 LTS >Reporter: Manish S N >Priority: Minor > Attachments: multilingual_test_new_UCS-2.txt, > multilingual_test_new_UCS-4.txt, multilingual_test_new_UTF-7.txt, > multilingual_test_new_UTF-8.txt, tika_UTF-7_output.txt > > > Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files. > No logs or errors just an empty string. > Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks > havoc with non ascii characters. > how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open > dialog of gedit and found the outputs similar > > I am attaching all four encoded files along with tika's output from parsing > the UTF-7 for reference -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4306) ffmpeg all the images
[ https://issues.apache.org/jira/browse/TIKA-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880335#comment-17880335 ] Tim Allison commented on TIKA-4306: --- The way to accomplish this would be to add more mime types here: https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml#L30 Any others besides jpeg2000 and jbig? > ffmpeg all the images > - > > Key: TIKA-4306 > URL: https://issues.apache.org/jira/browse/TIKA-4306 > Project: Tika > Issue Type: Wish > Components: tika-core >Affects Versions: 3.0.0-BETA >Reporter: Jim Northrup >Priority: Major > > jpeg2000 and JBIG and a few other image formats (numerous) could benefit from > the generic features of a simple `ffmpeg -i output` > seems like webp is low hanging fruit here as well as the format updates that > ffmpeg experiences rarely alter the commandline. > literally solving for one image format is typically only changing a file > extension with a few exemptions. > Thanks in advance for any conversation toward testing this -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4239) Update to 2.9.3
[ https://issues.apache.org/jira/browse/TIKA-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877432#comment-17877432 ] Tim Allison commented on TIKA-4239: --- Thank you! > Update to 2.9.3 > --- > > Key: TIKA-4239 > URL: https://issues.apache.org/jira/browse/TIKA-4239 > Project: Tika > Issue Type: Task > Components: build >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4301) Factor tika pipes base classes out of tika-core into a tika-pipes-core module
Tim Allison created TIKA-4301: - Summary: Factor tika pipes base classes out of tika-core into a tika-pipes-core module Key: TIKA-4301 URL: https://issues.apache.org/jira/browse/TIKA-4301 Project: Tika Issue Type: Task Reporter: Tim Allison This is part of the larger vision of TIKA-4272. This will slim down tika-core by itself, and it will prevent us from adding extra dependencies for (pf4j and semver and...) into tika-core on TIKA-4272. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875614#comment-17875614 ] Tim Allison commented on TIKA-4280: --- Got it. Now I see. Thank you. I don't think we have an active stakeholder for tika-dl or several of the submodules in tika-parsers-ml. :( I'd be happy to be proven wrong! So, that leaves the *-M* issues... are we ok with leaving those two as they are? > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875577#comment-17875577 ] Tim Allison commented on TIKA-4280: --- bq. Decide about the ffmpeg issue and the hdf5 issue Sorry, what are the issues here? > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875579#comment-17875579 ] Tim Allison commented on TIKA-4280: --- bq. TIKA-4290 Tilman question Does anything remain on this issue? I think we're good. That was a bunch of cleanup. > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875574#comment-17875574 ] Tim Allison commented on TIKA-4280: --- bq. Before releasing the real 3.0.0 we need to remove any "-M" dependencies I see commons.collections4 4.5.0-M2 and dl4j 1.0.0-M2.1 in Tika's parent pom. For commons collections4, the last non-M release was in 2019. Is there enough of a reason to revert to 4.4? For dl4j, the last non-M release would take us back to 2017 with 0.9.1. Is there enough of a reason to revert to 0.9.1? Looking at transitive dependencies, I see org.datavec:datavec-data-image:jar:1.0.0-M2.1, org.nd4j:jackson:jar:1.0.0-M2.1 and a few other org.nd4j:* (which are consistent with dl4j). So no surprises in transitive dependencies? Are there others *-M dependencies? Or, has this already been cleaned up? Thank you [~tilman] for raising this point. > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4299) Clean up pagination in AbstractPDF2XHTML
[ https://issues.apache.org/jira/browse/TIKA-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4299. --- Fix Version/s: 3.0.0 Resolution: Fixed > Clean up pagination in AbstractPDF2XHTML > > > Key: TIKA-4299 > URL: https://issues.apache.org/jira/browse/TIKA-4299 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > This is a follow on to TIKA-4296. We should remove the "why this hack" > comment and perhaps clean up small bits of other code related to pagination > in AbstractPDF2XHTML. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4299) Clean up pagination in AbstractPDF2XHTML
Tim Allison created TIKA-4299: - Summary: Clean up pagination in AbstractPDF2XHTML Key: TIKA-4299 URL: https://issues.apache.org/jira/browse/TIKA-4299 Project: Tika Issue Type: Task Reporter: Tim Allison This is a follow on to TIKA-4296. We should remove the "why this hack" comment and perhaps clean up small bits of other code related to pagination in AbstractPDF2XHTML. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4290) Fix code inspection anomalies
[ https://issues.apache.org/jira/browse/TIKA-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874900#comment-17874900 ] Tim Allison commented on TIKA-4290: --- Thank you [~tilman] and [~dkryukov]! > Fix code inspection anomalies > - > > Key: TIKA-4290 > URL: https://issues.apache.org/jira/browse/TIKA-4290 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4295) Allow bypass of emitKey in AbstractEmbeddedDocumentBytesHandler
[ https://issues.apache.org/jira/browse/TIKA-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4295. --- Fix Version/s: 3.0.0 Resolution: Fixed > Allow bypass of emitKey in AbstractEmbeddedDocumentBytesHandler > --- > > Key: TIKA-4295 > URL: https://issues.apache.org/jira/browse/TIKA-4295 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > We currently algorithmically determine the emit path for the bytes from > embedded documents based on the json emit key. In other words, if the primary > emit key is {{/a/b/c.json}}, we emit to {{/a/b/c/c-1000.jpg}}, for example. > We should allow users to set a custom emit key root path so that, for > example, users could have the json going here: {{/a/b/json/abc.json}} and the > bytes going here: {{/a/b/bytes/abc-.doc}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4295) Allow bypass of emitKey in AbstractEmbeddedDocumentBytesHandler
Tim Allison created TIKA-4295: - Summary: Allow bypass of emitKey in AbstractEmbeddedDocumentBytesHandler Key: TIKA-4295 URL: https://issues.apache.org/jira/browse/TIKA-4295 Project: Tika Issue Type: Task Reporter: Tim Allison We currently algorithmically determine the emit path for the bytes from embedded documents based on the json emit key. In other words, if the primary emit key is {{/a/b/c.json}}, we emit to {{/a/b/c/c-1000.jpg}}, for example. We should allow users to set a custom emit key root path so that, for example, users could have the json going here: {{/a/b/json/abc.json}} and the bytes going here: {{/a/b/bytes/abc-.doc}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4235) Add pipeline parameter to OpenSearch emitter
[ https://issues.apache.org/jira/browse/TIKA-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4235. --- Resolution: Won't Fix Reopen if needed > Add pipeline parameter to OpenSearch emitter > > > Key: TIKA-4235 > URL: https://issues.apache.org/jira/browse/TIKA-4235 > Project: Tika > Issue Type: New Feature >Reporter: Tim Allison >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4294. --- Resolution: Fixed Sorry for all the noise on this one. > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-4294: --- Should include some earlier simplification proposals from https://github.com/apache/tika/pull/1805 > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4294. --- Resolution: Fixed > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871163#comment-17871163 ] Tim Allison commented on TIKA-4251: --- My sense is that at some point, we have to throw up our hands and hope. The initial commit will definitely be too large to review. Perhaps I misunderstand, [~ndipiazza]? I'm also still mildly annoyed that we'd still have to use checkstyle to prohibit wildcard imports, if I understand cosium + google format plugin correctly. Would we also have still use checkstyle for license headers? I still think this is a better option than what we have now. > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871162#comment-17871162 ] Tim Allison commented on TIKA-4251: --- Use intellij using the checkstyle profile? Checkstyle set to google? Or Intellij's "Code Style->Scheme" set to google (e.g. https://github.com/google/styleguide/blob/gh-pages/intellij-java-google-style.xml)? > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871159#comment-17871159 ] Tim Allison commented on TIKA-4294: --- While adding a unit test, I found that we should also implement the TODO -- deserialize objects in lists. Once there's a clean build, I'll merge. > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-4294: --- Assignee: Tim Allison > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871137#comment-17871137 ] Tim Allison commented on TIKA-4294: --- K. Got it. This fixes the key to be the super class -- https://github.com/apache/tika/pull/1887/. I'll add a unit test before merging to guarantee expected behavior. > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871131#comment-17871131 ] Tim Allison commented on TIKA-4294: --- This is an example of what the json might look like. {code:json} "parseContext": { "org.apache.tika.metadata.filter.MetadataFilter": { "_class": "org.apache.tika.metadata.filter.CompositeMetadataFilter", "filters": [ { "_class": "org.apache.tika.metadata.filter.SomeFilter", ... {code} > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871130#comment-17871130 ] Tim Allison commented on TIKA-4294: --- The key in ParseContext should be {{superClazz}}, and I have a PR to fix that. :/ > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871128#comment-17871128 ] Tim Allison commented on TIKA-4294: --- Thank you, @tilman. Apologies... will {{className}} never equal {{superClassName}}? And, y, my thinking was that we'd avoid having to call {{Class.forName}} twice if we're doing it for the same class -- imaginary efficiency. :D > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4294. --- Fix Version/s: 3.0.0 Resolution: Fixed Thank you [~dimirsen] and [~tilman]! > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871095#comment-17871095 ] Tim Allison commented on TIKA-4252: --- Thank you [~tilman]! I'll work cleaning this up here: https://issues.apache.org/jira/browse/TIKA-4294 > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4294) Simplify serialization/deserialization of ParseContext
Tim Allison created TIKA-4294: - Summary: Simplify serialization/deserialization of ParseContext Key: TIKA-4294 URL: https://issues.apache.org/jira/browse/TIKA-4294 Project: Tika Issue Type: Task Reporter: Tim Allison Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the serialization and deserialization of ParseContext to avoid redundancy of the superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4291) In JDBCEmitter local var dateFormats shadows class filed with the same name
[ https://issues.apache.org/jira/browse/TIKA-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871086#comment-17871086 ] Tim Allison commented on TIKA-4291: --- LGTM. Thank you! > In JDBCEmitter local var dateFormats shadows class filed with the same name > --- > > Key: TIKA-4291 > URL: https://issues.apache.org/jira/browse/TIKA-4291 > Project: Tika > Issue Type: Bug > Components: tika-pipes >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: Dmitrii Kriukov >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > Line 338 of JDBCEmitter > Local variable dateFormats is created, populated with values, but never used > in its scope. > It's not clear how to fix. Was it planned to use class field with the same > type and name? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4289) Further improvements to the metadata filter and serialization
[ https://issues.apache.org/jira/browse/TIKA-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4289. --- Fix Version/s: 3.0.0 Resolution: Fixed > Further improvements to the metadata filter and serialization > - > > Key: TIKA-4289 > URL: https://issues.apache.org/jira/browse/TIKA-4289 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4288) Allow user configuration of MetadataFilters in PipesServer
[ https://issues.apache.org/jira/browse/TIKA-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4288. --- Fix Version/s: 3.0.0 Resolution: Fixed > Allow user configuration of MetadataFilters in PipesServer > -- > > Key: TIKA-4288 > URL: https://issues.apache.org/jira/browse/TIKA-4288 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 3.0.0 > > > We're currently configuring metadata filters in tika-config.xml. It would be > helpful (at least in the PipesServer?) to allow users to configure metadata > filters at parse time via the ParseContext. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4287) Improve PDFParserConfig serialization
[ https://issues.apache.org/jira/browse/TIKA-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4287. --- Fix Version/s: 3.0.0 Resolution: Fixed > Improve PDFParserConfig serialization > - > > Key: TIKA-4287 > URL: https://issues.apache.org/jira/browse/TIKA-4287 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > I thought I had fixed this as part of the parsecontext serialization work. > The fix didn't make it into main. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4289) Further improvements to the metadata filter and serialization
Tim Allison created TIKA-4289: - Summary: Further improvements to the metadata filter and serialization Key: TIKA-4289 URL: https://issues.apache.org/jira/browse/TIKA-4289 Project: Tika Issue Type: Task Reporter: Tim Allison -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4288) Allow user configuration of MetadataFilters in PipesServer
Tim Allison created TIKA-4288: - Summary: Allow user configuration of MetadataFilters in PipesServer Key: TIKA-4288 URL: https://issues.apache.org/jira/browse/TIKA-4288 Project: Tika Issue Type: Task Reporter: Tim Allison We're currently configuring metadata filters in tika-config.xml. It would be helpful (at least in the PipesServer?) to allow users to configure metadata filters at parse time via the ParseContext. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4287) Improve PDFParserConfig serialization
Tim Allison created TIKA-4287: - Summary: Improve PDFParserConfig serialization Key: TIKA-4287 URL: https://issues.apache.org/jira/browse/TIKA-4287 Project: Tika Issue Type: Task Reporter: Tim Allison I thought I had fixed this as part of the parsecontext serialization work. The fix didn't make it into main. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868724#comment-17868724 ] Tim Allison commented on TIKA-4280: --- Y, probably? I wasn't thinking of changing tika-server for 3.x just yet, but I thought we could start fresh with non-shaded jars with the tika-grpc server? I'd hope for something similar to/as simple as this line: https://github.com/apache/tika-docker/blob/main/minimal/Dockerfile#L68 but know that .bat and .sh files get more complex shortly after the first proof of concept is developed. :D What do you think? One option for tika-server, is that we move to non-shaded jars in the service scripts and actually fix them so that they work. > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868724#comment-17868724 ] Tim Allison edited comment on TIKA-4280 at 7/25/24 3:49 PM: Y, probably? I wasn't thinking of changing tika-server for 3.x just yet, but I thought we could start fresh with non-shaded jars with the tika-grpc server? I'd hope for something similar to/as simple as this line: https://github.com/apache/tika-docker/blob/main/minimal/Dockerfile#L68 but know that .bat and .sh files get more complex shortly after the first proof of concept is developed. :D What do you think? One option for tika-server, is that we keep the fat jar, but that we also move to non-shaded jars in the service scripts and actually fix them so that they work. was (Author: talli...@mitre.org): Y, probably? I wasn't thinking of changing tika-server for 3.x just yet, but I thought we could start fresh with non-shaded jars with the tika-grpc server? I'd hope for something similar to/as simple as this line: https://github.com/apache/tika-docker/blob/main/minimal/Dockerfile#L68 but know that .bat and .sh files get more complex shortly after the first proof of concept is developed. :D What do you think? One option for tika-server, is that we move to non-shaded jars in the service scripts and actually fix them so that they work. > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4285) Invalid Link for changelog CHANGES.txt files
[ https://issues.apache.org/jira/browse/TIKA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4285. --- Resolution: Fixed Thank you [~tom_1st] and [~tilman]! Should be fixed now. > Invalid Link for changelog CHANGES.txt files > > > Key: TIKA-4285 > URL: https://issues.apache.org/jira/browse/TIKA-4285 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.0, 2.9.1, 2.9.2 >Reporter: Lonzak >Assignee: Tim Allison >Priority: Major > > On the tika [start page|https://tika.apache.org/] the linked change log files > CHANGES.txt starting with version 2.9.0 are missing/broken. > > {+}Working{+}: > [https://archive.apache.org/dist/tika/2.8.0/CHANGES-2.8.0.txt] > [https://dist.apache.org/repos/dist/release/tika/3.0.0-BETA2/CHANGES-3.0.0-BETA2.txt] > +Not working:+ > [https://archive.apache.org/dist/]{-}{color:#ff}release{color}{-}/tika/2.9.0/CHANGES-2.9.0.txt > [https://archive.apache.org/dist/]{-}{color:#ff}release{color}{-}/tika/2.9.1/CHANGES-2.9.1.txt > [https://archive.apache.org/dist/]{-}{color:#ff}release{color}{-}/tika/2.9.2/CHANGES-2.9.2.txt > [https://archive.apache.org/dist/tika/3.0.0-BETA/CHANGES-3.0.0.txt] > > +*Wrong Text*+ (as mention by Tilman) > 15 July 2024: Apache Tika Release Apache Tika *3.0.0-BETA2* has been > released! This release includes several bug fixes and dependency upgrades. > Please see the > [CHANGES.txt|https://dist.apache.org/repos/dist/release/tika/3.0.0-BETA2/CHANGES-3.0.0-BETA2.txt] > file for the full list of changes in the release and have a look at the > download page for more information on how to obtain{color:#FF} *Apache > Tika 2.9.2.*{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (TIKA-4285) Invalid Link for changelog CHANGES.txt files
[ https://issues.apache.org/jira/browse/TIKA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-4285: - Assignee: Tim Allison > Invalid Link for changelog CHANGES.txt files > > > Key: TIKA-4285 > URL: https://issues.apache.org/jira/browse/TIKA-4285 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.0, 2.9.1, 2.9.2 >Reporter: Lonzak >Assignee: Tim Allison >Priority: Major > > On the tika [start page|https://tika.apache.org/] the linked change log files > CHANGES.txt starting with version 2.9.0 are missing/broken. > > {+}Working{+}: > [https://archive.apache.org/dist/tika/2.8.0/CHANGES-2.8.0.txt] > [https://dist.apache.org/repos/dist/release/tika/3.0.0-BETA2/CHANGES-3.0.0-BETA2.txt] > +Not working:+ > [https://archive.apache.org/dist/]{-}{color:#ff}release{color}{-}/tika/2.9.0/CHANGES-2.9.0.txt > [https://archive.apache.org/dist/]{-}{color:#ff}release{color}{-}/tika/2.9.1/CHANGES-2.9.1.txt > [https://archive.apache.org/dist/]{-}{color:#ff}release{color}{-}/tika/2.9.2/CHANGES-2.9.2.txt > [https://archive.apache.org/dist/tika/3.0.0-BETA/CHANGES-3.0.0.txt] > > +*Wrong Text*+ (as mention by Tilman) > 15 July 2024: Apache Tika Release Apache Tika *3.0.0-BETA2* has been > released! This release includes several bug fixes and dependency upgrades. > Please see the > [CHANGES.txt|https://dist.apache.org/repos/dist/release/tika/3.0.0-BETA2/CHANGES-3.0.0-BETA2.txt] > file for the full list of changes in the release and have a look at the > download page for more information on how to obtain{color:#FF} *Apache > Tika 2.9.2.*{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4281) Fix javadoc plugin configuration
[ https://issues.apache.org/jira/browse/TIKA-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866843#comment-17866843 ] Tim Allison commented on TIKA-4281: --- For some reason, now, it looks like {{javadocs}} works fine with {{install}} as long as {{install}} is after it in the commandline? Ha, but then the packaging step doesn't work.. If anyone knows what's going on, please help. This is weird. > Fix javadoc plugin configuration > > > Key: TIKA-4281 > URL: https://issues.apache.org/jira/browse/TIKA-4281 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > We added src/main/java to our parent pom back in > April when we updated the Apache parent pom. This line prevented the > generation of javadocs for the 3.0.0-BETA2 release. I don't know if we need > that line with our current version of the ASF parent pom. > Let's figure it out. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4281) Fix javadoc plugin configuration
[ https://issues.apache.org/jira/browse/TIKA-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866742#comment-17866742 ] Tim Allison commented on TIKA-4281: --- Well, that didn't work: {{javadoc: error - No source files for package ...}}. This is likely the problem that lead to adding the sourcepath. What I can't figure out is that {{javadocs:aggregate}} ran without any complaints even in debug mode (that I could find in the voluminous logs) but left nothing {{target/site}}. When I removed the line in the 3.0.0-BETA2 release's pom, {{javadocs:aggregate}} worked as expected. However, now we're getting that javadoc error in ci/cd, and I'm getting a related one -- just a different package -- locally. > Fix javadoc plugin configuration > > > Key: TIKA-4281 > URL: https://issues.apache.org/jira/browse/TIKA-4281 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > We added src/main/java to our parent pom back in > April when we updated the Apache parent pom. This line prevented the > generation of javadocs for the 3.0.0-BETA2 release. I don't know if we need > that line with our current version of the ASF parent pom. > Let's figure it out. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4281) Fix javadoc plugin configuration
Tim Allison created TIKA-4281: - Summary: Fix javadoc plugin configuration Key: TIKA-4281 URL: https://issues.apache.org/jira/browse/TIKA-4281 Project: Tika Issue Type: Task Reporter: Tim Allison We added src/main/java to our parent pom back in April when we updated the Apache parent pom. This line prevented the generation of javadocs for the 3.0.0-BETA2 release. I don't know if we need that line with our current version of the ASF parent pom. Let's figure it out. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4280: -- Description: I'm too lazy to open separate tickets. Please do so if desired. Some items: * Before releasing the real 3.0.0 we need to remove any "-M" dependencies * Decide about the ffmpeg issue and the hdf5 issue * Run the regression tests vs 2.9.x * Convert tika-grpc to use the dependency plugin instead of the shade plugin * Turn javadocs back on. I got errors during the deploy process because javadoc needed the auto-generated code ("cannot find symbol DeleteFetcherRequest"). We need to enable javadocs for the rest of the project. Other things? Thank you [~tilman] for the first two! was: I'm too lazy to open separate tickets. Please do so if desired. Some items: * Before releasing the real 3.0.0 we need to remove any "-M" dependencies * Decide about the ffmpeg issue and the hdf5 issue * Run the regression tests vs 2.9.x * Convert tika-grpc to use the dependency plugin instead of the shade plugin * Turn javadocs back on. I got errors during the deploy process because javadoc did not like the auto-generated code. We need to enable javadocs for the rest of the project. Other things? Thank you [~tilman] for the first two! > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4280: -- Description: I'm too lazy to open separate tickets. Please do so if desired. Some items: * Before releasing the real 3.0.0 we need to remove any "-M" dependencies * Decide about the ffmpeg issue and the hdf5 issue * Run the regression tests vs 2.9.x * Convert tika-grpc to use the dependency plugin instead of the shade plugin * Turn javadocs back on. I got errors during the deploy process because javadoc did not like the auto-generated code. We need to enable javadocs for the rest of the project. Other things? Thank you [~tilman] for the first two! was: I'm too lazy to open separate tickets. Please do so if desired. Some items: * Before releasing the real 3.0.0 we need to remove any "-M" dependencies * Decide about the ffmpeg issue and the hdf5 issue * Run the regression tests vs 2.9.x * Convert tika-grpc to use the dependency plugin instead of the shade plugin Other things? Thank you [~tilman] for the first two! > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc did not like the auto-generated code. We need to enable javadocs for > the rest of the project. > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4280) Tasks for the 3.0.0 release
Tim Allison created TIKA-4280: - Summary: Tasks for the 3.0.0 release Key: TIKA-4280 URL: https://issues.apache.org/jira/browse/TIKA-4280 Project: Tika Issue Type: Task Reporter: Tim Allison I'm too lazy to open separate tickets. Please do so if desired. Some items: * Before releasing the real 3.0.0 we need to remove any "-M" dependencies * Decide about the ffmpeg issue and the hdf5 issue * Run the regression tests vs 2.9.x * Convert tika-grpc to use the dependency plugin instead of the shade plugin Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866075#comment-17866075 ] Tim Allison commented on TIKA-4278: --- Thank you, [~tilman], y, that's probably an oversight on my part in the initial commits. Thank you for working on this. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4275) Make tika-grpc a top-level module
Tim Allison created TIKA-4275: - Summary: Make tika-grpc a top-level module Key: TIKA-4275 URL: https://issues.apache.org/jira/browse/TIKA-4275 Project: Tika Issue Type: Task Reporter: Tim Allison I'd like to move tika-grpc to the top level to be at the same level as tika-server. Any objections? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4272) create tika docker image for tika-grpc
[ https://issues.apache.org/jira/browse/TIKA-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860241#comment-17860241 ] Tim Allison commented on TIKA-4272: --- Y, I concur, we should have a completely separate image. > create tika docker image for tika-grpc > -- > > Key: TIKA-4272 > URL: https://issues.apache.org/jira/browse/TIKA-4272 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > now that the tika-grpc branch has been merge to main, we need a tika-grpc > server image. > i thought for a bit about using the same tika docker image as we already use > but that is probably not a good idea because there are vastly different jar > files needed for tika-grpc -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860035#comment-17860035 ] Tim Allison commented on TIKA-4251: --- W00t! > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860020#comment-17860020 ] Tim Allison commented on TIKA-4251: --- Sounds great. My personal preference would be to move away from our custom formatting to a standard, maybe google? I'd also like to forbid wildcarding, if possible. Fellow devs, any objections? Is that possible [~ndipiazza]? > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860007#comment-17860007 ] Tim Allison commented on TIKA-4251: --- > we eat the 1-time-format cost That's where the vulnerability is. That one huge one-time commit that'll be too big to review. :D Y, I completely agree, and I'm eager to move forward...with fingers crossed. If I never get another checkstyle failed build, I will be happy. :D > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1785#comment-1785 ] Tim Allison commented on TIKA-4251: --- Makes sense. Tilman's observation is legit, and I don't see a way around it. We took that risk with my commits to get checkstyle to work, but there's now a new supply chain vuln by allowing the plugin to make more changes than we can reasonably review. Cross our fingers and hope? > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859739#comment-17859739 ] Tim Allison edited comment on TIKA-4251 at 6/25/24 6:19 PM: Y. I agree. When I started with checkstyle. I had to modify a lot of files. Any recs for mitigating this? was (Author: talli...@mitre.org): Y. I agree. When I started with checkstyle, it modified nearly every file. Any recs for mitigating this? > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859739#comment-17859739 ] Tim Allison commented on TIKA-4251: --- Y. I agree. When I started with checkstyle, it modified nearly every file. Any recs for mitigating this? > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853241#comment-17853241 ] Tim Allison commented on TIKA-4243: --- This is what the json currently looks like. {code:json} { "emitter": "fse", "fetchKey": "testPDFTwoTextBoxes.pdf", "fetcher": "fsf", "id": "myId", "onParseException": "emit", "parseContext": { "org.apache.tika.parser.pdf.PDFParserConfig": { "_class": "org.apache.tika.parser.pdf.PDFParserConfig", "accessChecker": { "_class": "org.apache.tika.parser.pdf.AccessChecker" }, "averageCharTolerance": 0.3, "catchIntermediateIOExceptions": true, "detectAngles": false, "dropThreshold": 2.5, "enableAutoSpace": true, "extractAcroFormContent": true, "extractActions": false, "extractAnnotationText": true, "extractBookmarksText": true, "extractFontNames": false, "extractIncrementalUpdateInfo": false, "extractInlineImages": false, "extractMarkedContent": false, "extractUniqueInlineImagesOnly": true, "ifXFAExtractOnlyXFA": false, "imageGraphicsEngineFactory": { "_class": "org.apache.tika.parser.pdf.image.ImageGraphicsEngineFactory" }, "imageStrategy": "NONE", "maxIncrementalUpdates": 10, "maxMainMemoryBytes": 536870912, "ocrDPI": 300, "ocrImageFormatName": "png", "ocrImageQuality": 1.0, "ocrImageType": "GRAY", "ocrRenderingStrategy": "ALL", "ocrStrategy": "AUTO", "parseIncrementalUpdates": false, "renderer": null, "setKCMS": false, "sortByPosition": true, "spacingTolerance": 0.5, "suppressDuplicateOverlappingText": false, "throwOnEncryptedPayload": false } } }{code} > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853240#comment-17853240 ] Tim Allison commented on TIKA-4243: --- I opened a PR with some cleanup, fixes and a new unit test that confirms that the PDFParserConfig actually works in the pipes endpoint in tika-server. > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4268) Use title for embedded resource path in embedded msg files
[ https://issues.apache.org/jira/browse/TIKA-4268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4268. --- Fix Version/s: 3.0.0 Resolution: Fixed > Use title for embedded resource path in embedded msg files > -- > > Key: TIKA-4268 > URL: https://issues.apache.org/jira/browse/TIKA-4268 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > If an msg file is embedded in an msg file, the embedded_resource_path > currently looks like this: {{/__substg1.0_3701000D.msg/attachment.docx}}. > A more human-friendly path would be: {{/Test Attachment.msg/attachment.docx}} > We should update our PST parser as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853157#comment-17853157 ] Tim Allison commented on TIKA-4251: --- Unless there are any objections, I'll likely move forward with this early this coming week. If this will break any outstanding PRs or other work or if anyone thinks this is a bad idea, please let me know. This will only affect main/3.x. I am not going to back port this to 2.x. > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4268) Use title for embedded resource path in embedded msg files
Tim Allison created TIKA-4268: - Summary: Use title for embedded resource path in embedded msg files Key: TIKA-4268 URL: https://issues.apache.org/jira/browse/TIKA-4268 Project: Tika Issue Type: Task Reporter: Tim Allison If an msg file is embedded in an msg file, the embedded_resource_path currently looks like this: {{/__substg1.0_3701000D.msg/attachment.docx}}. A more human-friendly path would be: {{/Test Attachment.msg/attachment.docx}} We should update our PST parser as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852876#comment-17852876 ] Tim Allison edited comment on TIKA-4243 at 6/6/24 5:39 PM: --- I think our joint recent PR on TIKA-4252 accomplishes the goals of this ticket. There's more work, but I think we can close this out. If we do want to head down the jsonschema route later, let's open a new ticket? was (Author: talli...@mitre.org): I think our joint recent PR on TIKA-4252 accomplishes the goals of this ticket. There's more work, but I think we can close this out. If we do want to head down the jsonschema root later, let's open a new ticket? > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852876#comment-17852876 ] Tim Allison commented on TIKA-4243: --- I think our joint recent PR on TIKA-4252 accomplishes the goals of this ticket. There's more work, but I think we can close this out. If we do want to head down the jsonschema root later, let's open a new ticket? > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852874#comment-17852874 ] Tim Allison commented on TIKA-4252: --- K. I think we're at "good enough" here. [~ndipiazza], thank you and take it away! > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4252. --- Resolution: Fixed > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852808#comment-17852808 ] Tim Allison commented on TIKA-4243: --- Oh, and documentation, lots of documentation. :LOL: > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852804#comment-17852804 ] Tim Allison edited comment on TIKA-4243 at 6/6/24 2:11 PM: --- Current status on TIKA-4243 branch -- works up through and including tika-app Still need: * better job of handling lists and maps as parameters and types. * test tika-server pipes/ and async/ endpoints * more unit tests in new serialization stuff Ongoing needs: * modify config objects so that they work with the serialization methods was (Author: talli...@mitre.org): Current status on TIKA-4243 -- works up through and including tika-app Still need: * better job of handling lists and maps as parameters and types. * test tika-server pipes/ and async/ endpoints * more unit tests in new serialization stuff Ongoing needs: * modify config objects so that they work with the serialization methods > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852804#comment-17852804 ] Tim Allison commented on TIKA-4243: --- Current status on TIKA-4243 -- works up through and including tika-app Still need: * better job of handling lists and maps as parameters and types. * test tika-server pipes/ and async/ endpoints * more unit tests in new serialization stuff Ongoing needs: * modify config objects so that they work with the serialization methods > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852098#comment-17852098 ] Tim Allison commented on TIKA-4243: --- Let me know if there are any objections to heading in this direction. > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852097#comment-17852097 ] Tim Allison commented on TIKA-4243: --- K, I chatted briefly with [~ndipiazza] this morning. Unless there are objections, the simplest way forward is to build our own for now, we think. It won't be much work given the stuff I already did for xml...famous last words. This will keep jackson annotations out of tika-core. This will get us to 3.x and then [~ndipiazza] can refactor with the jsonschema stuff later. > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727 ] Tim Allison edited comment on TIKA-4243 at 6/3/24 5:10 PM: --- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompositeDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig) based on the http-headers. We'd want to extend this to handle inheritance and embedded objects... Something along these lines in json: {code:json} { "settings" : { "org.apache.tika.parser.pdf.PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, "org.apache.tika.parser.pdf.image.ImageGraphicsEngineFactory.class: { "_class":"com.tika.custom.OurCompanysFactory", "speed":"blazing", "dpi":1000 } }, "org.apache.tika.parser.Parser": { "_class":"org.apache.tika.parser.EmptyParser" } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* was (Author: talli...@mitre.org): I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompositeDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig) based on the http-headers. We'd want to extend this to handle inheritance and embedded objects... Something along these lines in json: {code:json} { "settings" : { "org.apache.tika.parser.pdf.PDFParserConfig": { "ocrDPI":300, "sortByPosition": true, }, "org.apache.tika.parser.Parser": { "_class":"org.apache.tika.parser.EmptyParser" } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727 ] Tim Allison edited comment on TIKA-4243 at 6/3/24 5:02 PM: --- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompositeDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig) based on the http-headers. We'd want to extend this to handle inheritance and embedded objects... Something along these lines in json: {code:json} { "settings" : { "org.apache.tika.parser.pdf.PDFParserConfig": { "ocrDPI":300, "sortByPosition": true, }, "org.apache.tika.parser.Parser": { "_class":"org.apache.tika.parser.EmptyParser" } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* was (Author: talli...@mitre.org): I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompositeDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig) based on the http-headers. We'd want to extend this to handle inheritance and embedded objects... Something along these lines in json: {code:json} { "settings" : { "org.apache.tika.parser.pdf.PDFParserConfig": { "ocrDPI":300, "sortByPosition": true, }, { "org.apache.tika.parser.Parser": { "_class":"org.apache.tika.parser.EmptyParser" } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Is
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727 ] Tim Allison edited comment on TIKA-4243 at 6/3/24 5:02 PM: --- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompositeDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig) based on the http-headers. We'd want to extend this to handle inheritance and embedded objects... Something along these lines in json: {code:json} { "settings" : { "org.apache.tika.parser.pdf.PDFParserConfig": { "ocrDPI":300, "sortByPosition": true, }, { "org.apache.tika.parser.Parser": { "_class":"org.apache.tika.parser.EmptyParser" } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* was (Author: talli...@mitre.org): I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompoundDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig based on the http-headers). We'd want to extend this to handle inheritance. Something along these lines in json: {code:json} { "settings" : { "PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority:
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727 ] Tim Allison edited comment on TIKA-4243 at 6/3/24 4:45 PM: --- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompoundDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig based on the http-headers). We'd want to extend this to handle inheritance. Something along these lines in json: {code:json} { "settings" : { "PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* was (Author: talli...@mitre.org): I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompoundDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig based on the http-headers). We'd want to extend this to handle inheritance. Something along these lines in json: {code:json} { "settings" : { "PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) * What I don't like about this is that we're back in the game of creating our own serialization framework. :( * > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of d
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727 ] Tim Allison edited comment on TIKA-4243 at 6/3/24 4:45 PM: --- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompoundDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig based on the http-headers). We'd want to extend this to handle inheritance. Something along these lines in json: {code:json} { "settings" : { "PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) * What I don't like about this is that we're back in the game of creating our own serialization framework. :( * was (Author: talli...@mitre.org): I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{parseContext.set(Parser.class, new EmptyParser())}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios). We'd want to deal with embedded objects for the obvious use cases of the CompoundDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter -- for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig based on the http-headers). We'd want to extend this to handle inheritance. Something along these lines in json: {code:json} { "settings" : { "PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :( * > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doi
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727 ] Tim Allison commented on TIKA-4243: --- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{parseContext.set(Parser.class, new EmptyParser())}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios). We'd want to deal with embedded objects for the obvious use cases of the CompoundDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter -- for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig based on the http-headers). We'd want to extend this to handle inheritance. Something along these lines in json: {code:json} { "settings" : { "PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :( * > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4260) Add parse context to the fetcher interface in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4260. --- Resolution: Duplicate Turns out this is a duplicate. Onwards to TIKA-4243! > Add parse context to the fetcher interface in 3.x > - > > Key: TIKA-4260 > URL: https://issues.apache.org/jira/browse/TIKA-4260 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4266) Improve multithreading and the xml parser pools in XMLUtils
Tim Allison created TIKA-4266: - Summary: Improve multithreading and the xml parser pools in XMLUtils Key: TIKA-4266 URL: https://issues.apache.org/jira/browse/TIKA-4266 Project: Tika Issue Type: Task Reporter: Tim Allison I recently came across a build failure when running maven multithreaded {{-T10}}. The cause was a race condition/thread contention in XMLUtils. I don't _think_ anyone has seen this in practice in the wild, but we should improve the thread-safety of XMLUtils. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4221) Regression in pack200 parsing in commons-compress
[ https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4221. --- Fix Version/s: 3.0.0 2.9.3 Resolution: Fixed Many thanks to [~ggregory] and {{commons-compress}}! > Regression in pack200 parsing in commons-compress > - > > Key: TIKA-4221 > URL: https://issues.apache.org/jira/browse/TIKA-4221 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > There's a regression in pack200 that leads to the InputStream being closed > even if wrapped in a CloseShieldInputStream. > This was the original signal that something was wrong, but the real problem > is in pack200, not xz. > We noticed ~10 xz files with fewer attachments in the recent regression tests > in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, > but not a blocker (IMHO). > The stacktrace from > {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}} > looks like this: > 3: X-TIKA:EXCEPTION:embedded_exception : > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from > org.apache.tika.parser.DefaultParser@56a4479a > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) > at > org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109) > at > org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164) > at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446) > at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436) > at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424) > at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418) > at > org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563) > ... > Caused by: org.tukaani.xz.XZIOException: Stream closed > at org.tukaani.xz.SingleXZInputStream.available(Unknown Source) > at > org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115) > at java.io.FilterInputStream.available(FilterInputStream.java:168) > at > org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84) > at java.io.BufferedInputStream.available(BufferedInputStream.java:410) > at java.io.FilterInputStream.available(FilterInputStream.java:168) > at > org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84) > at java.io.FilterInputStream.available(FilterInputStream.java:168) > at > org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84) > at > org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800) > at > org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412) > at > org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389) > at > org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49) > at > org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389) > at > org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) > ... 85 more -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4220) Commons-compress too lenient on headless tar detection
[ https://issues.apache.org/jira/browse/TIKA-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4220. --- Fix Version/s: 3.0.0 2.9.3 Resolution: Fixed Many thanks to [~ggregory] and {{commons-compress}}! > Commons-compress too lenient on headless tar detection > -- > > Key: TIKA-4220 > URL: https://issues.apache.org/jira/browse/TIKA-4220 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > > On recent regression tests on TIKA-4218, we noticed a fairly major change > with an increased rate of false positives on headless tar detection from > commons-compress. > I think for now we should copy/paste/fork the headless tar detection and > improve it/revert it or possibly remove it for our 2.9.2 release. > On this ticket, I'll look into what changed recently in headless tar > detection in commons-compress and experiment with fixing it. > One challenge is that our magic bytes detection happens _after_ our custom > detectors, which means that we can't put a low confidence on what comes out > of our custom detectors and let the magic detection fix it. We could > implement an x-tar special case, but I really don't like that. > Let's see what we can do... > The numbers below represent the number of files identified as A (in tika > 2.9.1) -> B (in tika-2.9.2-pre-rc1). > application/octet-stream -> application/x-tar 826 > multipart/appledouble -> application/x-tar701 > image/x-tga -> application/x-tar 322 > image/vnd.microsoft.icon -> application/x-tar 312 > application/vnd.iccprofile -> application/x-tar 221 > video/mp4 -> application/x-tar177 > audio/mpeg -> application/x-tar 59 > video/x-m4v -> application/x-tar 59 > application/x-font-printer-metric -> application/x-tar36 > audio/mp4 -> application/x-tar25 > application/x-tex-tfm -> application/x-tar18 > image/x-pict -> application/x-tar 15 > image/png -> application/x-tar8 > text/plain; charset=ISO-8859-1 -> application/x-tar 8 > application/x-endnote-style -> application/x-tar 7 > application/x-font-ttf -> application/x-tar 6 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4265) Consider adding maven build cache extension
[ https://issues.apache.org/jira/browse/TIKA-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850776#comment-17850776 ] Tim Allison commented on TIKA-4265: --- It doesn't help at all if there's a modification in tika-core, even in a unit test, but I think this will be quite helpful when working on other modules, especially those not so early in the build tree. > Consider adding maven build cache extension > --- > > Key: TIKA-4265 > URL: https://issues.apache.org/jira/browse/TIKA-4265 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > This would be for 3.x. It wouldn't speed up ci/cd, but it may help with > local builds. It requires maven >= 3.9 > Has anyone used it? Think it will be a good fit for Tika? > https://maven.apache.org/extensions/maven-build-cache-extension/ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4265) Consider adding maven build cache extension
[ https://issues.apache.org/jira/browse/TIKA-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850773#comment-17850773 ] Tim Allison commented on TIKA-4265: --- I just pushed a demo to {{build-cache}}. This includes documentation in the README.md on how to turn off the build cache > Consider adding maven build cache extension > --- > > Key: TIKA-4265 > URL: https://issues.apache.org/jira/browse/TIKA-4265 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > This would be for 3.x. It wouldn't speed up ci/cd, but it may help with > local builds. It requires maven >= 3.9 > Has anyone used it? Think it will be a good fit for Tika? > https://maven.apache.org/extensions/maven-build-cache-extension/ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4265) Consider adding maven build cache extension
Tim Allison created TIKA-4265: - Summary: Consider adding maven build cache extension Key: TIKA-4265 URL: https://issues.apache.org/jira/browse/TIKA-4265 Project: Tika Issue Type: Task Reporter: Tim Allison This would be for 3.x. It wouldn't speed up ci/cd, but it may help with local builds. It requires maven >= 3.9 Has anyone used it? Think it will be a good fit for Tika? https://maven.apache.org/extensions/maven-build-cache-extension/ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4261) Add attachment type metadata filter
Tim Allison created TIKA-4261: - Summary: Add attachment type metadata filter Key: TIKA-4261 URL: https://issues.apache.org/jira/browse/TIKA-4261 Project: Tika Issue Type: Task Reporter: Tim Allison For some users who are using the /rmeta endpoint or -J option in tika-app, inlining ocr'd content, there is no need to include the metadata object for the inlined image. Let's add a metadata filter to remove these metadata objects. The default behavior will be as before. Everything is included. Users need to configure this to remove these inline objects. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4259) Decouple xml parser stuff from ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4259. --- Fix Version/s: 3.0.0 Resolution: Fixed > Decouple xml parser stuff from ParseContext > --- > > Key: TIKA-4259 > URL: https://issues.apache.org/jira/browse/TIKA-4259 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > ParseContext has some xmlreader convenience methods. We should move those to > XMLReaderUtils in 3.x to simplify ParseContext's api. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849298#comment-17849298 ] Tim Allison commented on TIKA-4260: --- That PR currently only works on tika-core. More needs to be done before we can merge this if this is the direction we'd like to go. > Add parse context to the fetcher interface in 3.x > - > > Key: TIKA-4260 > URL: https://issues.apache.org/jira/browse/TIKA-4260 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849288#comment-17849288 ] Tim Allison commented on TIKA-4243: --- [~ndipiazza], I added parseContext to fetchers and emitters on the TIKA-4260 branch. That might be a good start for serializing the ParseContext? All, let me know what you think. > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849103#comment-17849103 ] Tim Allison edited comment on TIKA-4243 at 5/24/24 1:00 PM: Proposed basic roadmap: Add parseContext to fetchers and emitters (and pipesReporter?) Serialize ParseContext as is... Allow for serialization of current XConfigs, eg. PDFParserConfig, etc. Add creation of parsers with e.g. new PDFParser(ParseContext context). Wire config stuff into tika-server, tika-pipes, tika-app Merge tika-grpc-server with new config options This would require serialization of classes that users want to be able to configure + serialization. This would allow us to get rid of all of our custom serialization stuff for Tika 4.x. was (Author: talli...@mitre.org): Proposed basic roadmap: Serialize ParseContext as is... Allow for serialization of current XConfigs, eg. PDFParserConfig, etc. Add creation of parsers with e.g. new PDFParser(ParseContext context). Wire config stuff into tika-server, tika-pipes, tika-app Merge tika-grpc-server with new config options This would require serialization of classes that users want to be able to configure + serialization. This would allow us to get rid of all of our custom serialization stuff for Tika 4.x. > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4260) Add parse context to the fetcher interface in 3.x
Tim Allison created TIKA-4260: - Summary: Add parse context to the fetcher interface in 3.x Key: TIKA-4260 URL: https://issues.apache.org/jira/browse/TIKA-4260 Project: Tika Issue Type: Task Reporter: Tim Allison -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4259) Decouple xml parser stuff from ParseContext
Tim Allison created TIKA-4259: - Summary: Decouple xml parser stuff from ParseContext Key: TIKA-4259 URL: https://issues.apache.org/jira/browse/TIKA-4259 Project: Tika Issue Type: Task Reporter: Tim Allison ParseContext has some xmlreader convenience methods. We should move those to XMLReaderUtils in 3.x to simplify ParseContext's api. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849114#comment-17849114 ] Tim Allison commented on TIKA-4243: --- I'm going to start working on PRs that will be generally helpful for the above, but they'll still be useful if we all choose a different direction. I'll hold off on the core work for a bit in case there are objections or better ways forward. > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849108#comment-17849108 ] Tim Allison commented on TIKA-4243: --- The downsides we see: a) if we there's agreement to add jackson-annotations to tika-core, we add a few kb to tika-core b) we're at risk of having jackson-annotations sprinkled throughout our codebase on the XConfig classes, but this is basically where we have our own @Field annotations now. So break even? c) Customized classes that need to be passed via the ParseContext will need to be serializable to be used in tika-server, tika-pipes...etc. anything that allows for configuration. > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)