[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765907#comment-17765907
 ] 

ASF GitHub Bot commented on TIKA-3948:
--

solomax commented on PR #1345:
URL: https://github.com/apache/tika/pull/1345#issuecomment-1722134821

   @tballison I believe warning from `shade:3.5.0:shade` have to be addressed 
before merge:
   
   lots of them 
   
   ```
   [WARNING] log4j-slf4j2-impl-2.20.0.jar, tika-async-cli-3.0.0-SNAPSHOT.jar 
define 13 overlapping classes and resources: 
   [WARNING]   - 
META-INF/maven/org.apache.logging.log4j/log4j-slf4j2-impl/pom.properties
   [WARNING]   - 
META-INF/maven/org.apache.logging.log4j/log4j-slf4j2-impl/pom.xml
   [WARNING]   - org.apache.logging.slf4j.Log4jEventBuilder
   [WARNING]   - org.apache.logging.slf4j.Log4jLogger
   [WARNING]   - org.apache.logging.slf4j.Log4jLoggerFactory
   [WARNING]   - org.apache.logging.slf4j.Log4jMDCAdapter
   [WARNING]   - org.apache.logging.slf4j.Log4jMDCAdapter$1
   [WARNING]   - org.apache.logging.slf4j.Log4jMDCAdapter$ThreadLocalMapOfStacks
   [WARNING]   - org.apache.logging.slf4j.Log4jMarker
   [WARNING]   - org.apache.logging.slf4j.Log4jMarkerFactory
   [WARNING]   - 3 more...
   
   [WARNING] jaxb-core-4.0.3.jar, tika-server-core-3.0.0-SNAPSHOT.jar define 
128 overlapping classes and resources: 
   [WARNING]   - META-INF/maven/org.glassfish.jaxb/jaxb-core/pom.properties
   [WARNING]   - META-INF/maven/org.glassfish.jaxb/jaxb-core/pom.xml
   [WARNING]   - org.glassfish.jaxb.core.Locatable
   [WARNING]   - org.glassfish.jaxb.core.StackHelper
   [WARNING]   - org.glassfish.jaxb.core.Utils
   [WARNING]   - org.glassfish.jaxb.core.WhiteSpaceProcessor
   [WARNING]   - org.glassfish.jaxb.core.annotation.OverrideAnnotationOf
   [WARNING]   - org.glassfish.jaxb.core.annotation.XmlIsSet
   [WARNING]   - org.glassfish.jaxb.core.annotation.XmlLocation
   [WARNING]   - org.glassfish.jaxb.core.api.ErrorListener
   [WARNING]   - 118 more...
   
   ```
   
   As well as bundle inclusion list from here: 
`tika-bundles/tika-bundle-standard/pom.xml`




> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [tika] solomax commented on pull request #1345: TIKA-3948 -- migrate from javax -> jakarta

2023-09-15 Thread via GitHub


solomax commented on PR #1345:
URL: https://github.com/apache/tika/pull/1345#issuecomment-1722134821

   @tballison I believe warning from `shade:3.5.0:shade` have to be addressed 
before merge:
   
   lots of them 
   
   ```
   [WARNING] log4j-slf4j2-impl-2.20.0.jar, tika-async-cli-3.0.0-SNAPSHOT.jar 
define 13 overlapping classes and resources: 
   [WARNING]   - 
META-INF/maven/org.apache.logging.log4j/log4j-slf4j2-impl/pom.properties
   [WARNING]   - 
META-INF/maven/org.apache.logging.log4j/log4j-slf4j2-impl/pom.xml
   [WARNING]   - org.apache.logging.slf4j.Log4jEventBuilder
   [WARNING]   - org.apache.logging.slf4j.Log4jLogger
   [WARNING]   - org.apache.logging.slf4j.Log4jLoggerFactory
   [WARNING]   - org.apache.logging.slf4j.Log4jMDCAdapter
   [WARNING]   - org.apache.logging.slf4j.Log4jMDCAdapter$1
   [WARNING]   - org.apache.logging.slf4j.Log4jMDCAdapter$ThreadLocalMapOfStacks
   [WARNING]   - org.apache.logging.slf4j.Log4jMarker
   [WARNING]   - org.apache.logging.slf4j.Log4jMarkerFactory
   [WARNING]   - 3 more...
   
   [WARNING] jaxb-core-4.0.3.jar, tika-server-core-3.0.0-SNAPSHOT.jar define 
128 overlapping classes and resources: 
   [WARNING]   - META-INF/maven/org.glassfish.jaxb/jaxb-core/pom.properties
   [WARNING]   - META-INF/maven/org.glassfish.jaxb/jaxb-core/pom.xml
   [WARNING]   - org.glassfish.jaxb.core.Locatable
   [WARNING]   - org.glassfish.jaxb.core.StackHelper
   [WARNING]   - org.glassfish.jaxb.core.Utils
   [WARNING]   - org.glassfish.jaxb.core.WhiteSpaceProcessor
   [WARNING]   - org.glassfish.jaxb.core.annotation.OverrideAnnotationOf
   [WARNING]   - org.glassfish.jaxb.core.annotation.XmlIsSet
   [WARNING]   - org.glassfish.jaxb.core.annotation.XmlLocation
   [WARNING]   - org.glassfish.jaxb.core.api.ErrorListener
   [WARNING]   - 118 more...
   
   ```
   
   As well as bundle inclusion list from here: 
`tika-bundles/tika-bundle-standard/pom.xml`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4129) Upgrade dependencies requiring > Java 8

2023-09-15 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765862#comment-17765862
 ] 

Hudson commented on TIKA-4129:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1250 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1250/])
TIKA-4129: update javadoc, mockito, jwarc, enforcer (tilman: 
[https://github.com/apache/tika/commit/3f7a89544a5d55a1cc6ae8adb154e6d409d0bbb3])
* (edit) tika-parent/pom.xml


> Upgrade dependencies requiring > Java 8
> ---
>
> Key: TIKA-4129
> URL: https://issues.apache.org/jira/browse/TIKA-4129
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 3.0.0-BETA
>
>
> On TIKA-3735, we documented several dependencies that required > Java 8.  Now 
> that we're working with Java 11 on main, let's make those upgrades.
> There was already a separate ticket open for Lucene (TIKA-3641), so let's 
> upgrade these:
>  * Apache OpenNLP 2.0.0 requires Java 11.
>  * DL4J 1.0.0-M2.1 - datavec-data-image-1.0.0-M2.1.jar requires Java 11
>  * Fakeload
>  * 
> [checkstyle|https://mail.google.com/mail/u/0/#label/lists%2Ftika/WhctKKXXHvjnJRRdBSwLbKkDkXQtRnWGDhblVMQQZhjsDGrFpRMRQJJrZSdskrNCqcmTtjL]
>  * errorprone requires Java 11 for the build (doesn't mean we can't target 8)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4129) Upgrade dependencies requiring > Java 8

2023-09-15 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765830#comment-17765830
 ] 

Hudson commented on TIKA-4129:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1249 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1249/])
TIKA-4129: update tyrus (tilman: 
[https://github.com/apache/tika/commit/7da1d6b70b370ea7fc06793b574dd2776981aeb3])
* (edit) tika-translate/pom.xml


> Upgrade dependencies requiring > Java 8
> ---
>
> Key: TIKA-4129
> URL: https://issues.apache.org/jira/browse/TIKA-4129
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 3.0.0-BETA
>
>
> On TIKA-3735, we documented several dependencies that required > Java 8.  Now 
> that we're working with Java 11 on main, let's make those upgrades.
> There was already a separate ticket open for Lucene (TIKA-3641), so let's 
> upgrade these:
>  * Apache OpenNLP 2.0.0 requires Java 11.
>  * DL4J 1.0.0-M2.1 - datavec-data-image-1.0.0-M2.1.jar requires Java 11
>  * Fakeload
>  * 
> [checkstyle|https://mail.google.com/mail/u/0/#label/lists%2Ftika/WhctKKXXHvjnJRRdBSwLbKkDkXQtRnWGDhblVMQQZhjsDGrFpRMRQJJrZSdskrNCqcmTtjL]
>  * errorprone requires Java 11 for the build (doesn't mean we can't target 8)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4133) Add capture group metadataFilter

2023-09-15 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765829#comment-17765829
 ] 

Hudson commented on TIKA-4133:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1249 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1249/])
TIKA-4133 -- add a capture group metadatafilter (#1346) (github: 
[https://github.com/apache/tika/commit/aeb637b5761f65514bc3c0ede3f1f893ba7f14ff])
* (add) 
tika-core/src/test/resources/org/apache/tika/config/TIKA-4133-capture-group-overwrite.xml
* (edit) 
tika-core/src/test/java/org/apache/tika/metadata/filter/TestMetadataFilter.java
* (add) 
tika-core/src/test/resources/org/apache/tika/config/TIKA-4133-capture-group.xml
* (add) 
tika-core/src/main/java/org/apache/tika/metadata/filter/CaptureGroupMetadataFilter.java


> Add capture group metadataFilter
> 
>
> Key: TIKA-4133
> URL: https://issues.apache.org/jira/browse/TIKA-4133
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> There are some cases where it would be useful to run a regex to capture 
> specific values in a metadata object.
> For example, some users might not want the mime attributes (e.g. charset) as 
> in "text/html; charset=UTF-8".
> Let's start with a simple regex capture group filter.  If we need to capture 
> multiple matches etc, we can add that on a later ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4123) Update to 2.9.1

2023-09-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765755#comment-17765755
 ] 

Tilman Hausherr commented on TIKA-4123:
---

Yes that's fine.

> Update to 2.9.1
> ---
>
> Key: TIKA-4123
> URL: https://issues.apache.org/jira/browse/TIKA-4123
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.0
>Reporter: Tilman Hausherr
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4123) Update to 2.9.1

2023-09-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765754#comment-17765754
 ] 

Tim Allison commented on TIKA-4123:
---

Sorry.  My proposal was to keep doing what you were doing (if you want to?) and 
just keep merging the dependabot PRs on {{main}}.

Rather than cherrypicking all those back to 2.x one at a time, I can do a bulk 
update before our next 2.x release on the {{branch_2x}}?

> Update to 2.9.1
> ---
>
> Key: TIKA-4123
> URL: https://issues.apache.org/jira/browse/TIKA-4123
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.0
>Reporter: Tilman Hausherr
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4133) Add capture group metadataFilter

2023-09-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765752#comment-17765752
 ] 

ASF GitHub Bot commented on TIKA-4133:
--

tballison merged PR #1346:
URL: https://github.com/apache/tika/pull/1346




> Add capture group metadataFilter
> 
>
> Key: TIKA-4133
> URL: https://issues.apache.org/jira/browse/TIKA-4133
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> There are some cases where it would be useful to run a regex to capture 
> specific values in a metadata object.
> For example, some users might not want the mime attributes (e.g. charset) as 
> in "text/html; charset=UTF-8".
> Let's start with a simple regex capture group filter.  If we need to capture 
> multiple matches etc, we can add that on a later ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [tika] tballison merged pull request #1346: TIKA-4133 -- add a capture group metadatafilter

2023-09-15 Thread via GitHub


tballison merged PR #1346:
URL: https://github.com/apache/tika/pull/1346


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (TIKA-4123) Update to 2.9.1

2023-09-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765743#comment-17765743
 ] 

Tilman Hausherr edited comment on TIKA-4123 at 9/15/23 5:48 PM:


Yes... I just set up a clone and had a look and it seems to be behind on a lot. 
I've ran these checks from time to time in the past so we get any problems long 
before a release is intended instead of all at once. But I'll miss dependabot. 
It wasn't perfect but it did find updates that the versions didn't find, and 
was also able to keep "boundaries", i.e. ignore major versions of some 
artefacts.


was (Author: tilman):
Yes... I just set up a clone and had a look and it seems to be behind on a lot. 
I've ran these checks from time to time in the past so we get any problems long 
before a release is intended instead of all at once. But I'll miss dependabot. 
It wasn't perfect but it did find updates that the versions didn't find.

> Update to 2.9.1
> ---
>
> Key: TIKA-4123
> URL: https://issues.apache.org/jira/browse/TIKA-4123
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.0
>Reporter: Tilman Hausherr
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4123) Update to 2.9.1

2023-09-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765743#comment-17765743
 ] 

Tilman Hausherr commented on TIKA-4123:
---

Yes... I just set up a clone and had a look and it seems to be behind on a lot. 
I've ran these checks from time to time in the past so we get any problems long 
before a release is intended instead of all at once. But I'll miss dependabot. 
It wasn't perfect but it did find updates that the versions didn't find.

> Update to 2.9.1
> ---
>
> Key: TIKA-4123
> URL: https://issues.apache.org/jira/browse/TIKA-4123
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.0
>Reporter: Tilman Hausherr
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4123) Update to 2.9.1

2023-09-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765727#comment-17765727
 ] 

Tim Allison commented on TIKA-4123:
---

Now that we have a 2.x branch, I'm wondering if we should continue as we were 
with taking in the dependabot updates on {{main}} and then running {{mvn 
versions:display-dependency-updates}} and making the updates on 2.x shortly 
before the next 2.x release?

> Update to 2.9.1
> ---
>
> Key: TIKA-4123
> URL: https://issues.apache.org/jira/browse/TIKA-4123
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.0
>Reporter: Tilman Hausherr
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4120) GDAL unit test failing with recent version of gdal

2023-09-15 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4120.
---
Fix Version/s: 2.9.1
   3.0.0-BETA
   Resolution: Fixed

> GDAL unit test failing with recent version of gdal
> --
>
> Key: TIKA-4120
> URL: https://issues.apache.org/jira/browse/TIKA-4120
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.9.1, 3.0.0-BETA
>
>
> I just brew installed: "GDAL 3.7.1, released 2023/07/06", and I'm getting a 
> unit test failure on the gdal parser.  That version apparently doesn't 
> extract the coordinate system.  All the other gdal tests pass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4123) Update to 2.9.1

2023-09-15 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765723#comment-17765723
 ] 

Hudson commented on TIKA-4123:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1248 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1248/])
TIKA-4123 -- general updates for 3.0.0-BETA -- upgrade commons-compress 
(tallison: 
[https://github.com/apache/tika/commit/3c882460838c818ab2aff310d1fba9a084fe4800])
* (edit) tika-parent/pom.xml


> Update to 2.9.1
> ---
>
> Key: TIKA-4123
> URL: https://issues.apache.org/jira/browse/TIKA-4123
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.0
>Reporter: Tilman Hausherr
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4120) GDAL unit test failing with recent version of gdal

2023-09-15 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765724#comment-17765724
 ] 

Hudson commented on TIKA-4120:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1248 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1248/])
TIKA-4120 -- comment out test that breaks with recent version of gdalinfo 
(tallison: 
[https://github.com/apache/tika/commit/3adb2e2ad44421142e4da1c283be9cead1c1d10a])
* (edit) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-module/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java


> GDAL unit test failing with recent version of gdal
> --
>
> Key: TIKA-4120
> URL: https://issues.apache.org/jira/browse/TIKA-4120
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> I just brew installed: "GDAL 3.7.1, released 2023/07/06", and I'm getting a 
> unit test failure on the gdal parser.  That version apparently doesn't 
> extract the coordinate system.  All the other gdal tests pass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4133) Add capture group metadataFilter

2023-09-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765717#comment-17765717
 ] 

ASF GitHub Bot commented on TIKA-4133:
--

tballison opened a new pull request, #1346:
URL: https://github.com/apache/tika/pull/1346

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add capture group metadataFilter
> 
>
> Key: TIKA-4133
> URL: https://issues.apache.org/jira/browse/TIKA-4133
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> There are some cases where it would be useful to run a regex to capture 
> specific values in a metadata object.
> For example, some users might not want the mime attributes (e.g. charset) as 
> in "text/html; charset=UTF-8".
> Let's start with a simple regex capture group filter.  If we need to capture 
> multiple matches etc, we can add that on a later ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [tika] tballison opened a new pull request, #1346: TIKA-4133 -- add a capture group metadatafilter

2023-09-15 Thread via GitHub


tballison opened a new pull request, #1346:
URL: https://github.com/apache/tika/pull/1346

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (TIKA-4133) Add capture group metadataFilter

2023-09-15 Thread Tim Allison (Jira)
Tim Allison created TIKA-4133:
-

 Summary: Add capture group metadataFilter
 Key: TIKA-4133
 URL: https://issues.apache.org/jira/browse/TIKA-4133
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


There are some cases where it would be useful to run a regex to capture 
specific values in a metadata object.

For example, some users might not want the mime attributes (e.g. charset) as in 
"text/html; charset=UTF-8".

Let's start with a simple regex capture group filter.  If we need to capture 
multiple matches etc, we can add that on a later ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Martin Desruisseaux (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765680#comment-17765680
 ] 

Martin Desruisseaux commented on TIKA-3948:
---

Hello Tim. Snapshots have been deployed on 
[https://repository.apache.org/content/repositories/snapshots/]

I still have to work on signing and bundling of source code and javadoc (this 
is my first deployment with Gradle instead of Maven, so I still have to learn), 
after that I can start a thread for SIS release. It may take about 2 weeks for 
discussion, vote, etc.

> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765635#comment-17765635
 ] 

ASF GitHub Bot commented on TIKA-3948:
--

tballison commented on PR #1345:
URL: https://github.com/apache/tika/pull/1345#issuecomment-1721284115

   Draft stage until the next SIS release.  We cannot merge until that release. 
 The current draft relies on a local build of their snaphot.




> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765634#comment-17765634
 ] 

ASF GitHub Bot commented on TIKA-3948:
--

tballison opened a new pull request, #1345:
URL: https://github.com/apache/tika/pull/1345

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [tika] tballison commented on pull request #1345: TIKA-3948 -- migrate from javax -> jakarta

2023-09-15 Thread via GitHub


tballison commented on PR #1345:
URL: https://github.com/apache/tika/pull/1345#issuecomment-1721284115

   Draft stage until the next SIS release.  We cannot merge until that release. 
 The current draft relies on a local build of their snaphot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [tika] tballison opened a new pull request, #1345: TIKA-3948 -- migrate from javax -> jakarta

2023-09-15 Thread via GitHub


tballison opened a new pull request, #1345:
URL: https://github.com/apache/tika/pull/1345

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765632#comment-17765632
 ] 

Tim Allison commented on TIKA-3948:
---

Got a clean local build with Java 17 and confirmed all works running on Java 11.

> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765630#comment-17765630
 ] 

Tim Allison commented on TIKA-3948:
---

Given that the changes btwn 2.x and 3.x are so trivial.  I propose that we 
release a 3.0.0-BETA and then 3.0.0.  I don't think we should bother with an 
ALPHA release.  WDYT?

> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765629#comment-17765629
 ] 

Tim Allison commented on TIKA-3948:
---

Almost. We can't use a snapshot dependency in a release, obv.

Also: https://issues.apache.org/jira/browse/TIKA-4132


> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Maxim Solodovnik (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765627#comment-17765627
 ] 

Maxim Solodovnik commented on TIKA-3948:


Time to release? ;)

> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765624#comment-17765624
 ] 

Tim Allison commented on TIKA-3948:
---

W00t!   Thank you [~desruisseaux]!

[~solomax], I'm now getting a full clean build on the TIKA-3948 branch!!!

> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Martin Desruisseaux (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765621#comment-17765621
 ] 

Martin Desruisseaux commented on TIKA-3948:
---

Hello Tim

This is an error in the SIS {{build.gradle.kts}} file. I already have a local 
fix, not yet pushed (will push in a few hours). Sorry for the delay, and glad 
that you could workaround!

I'm working right now in trying to get the CI to work so that snapshots can be 
deployed again. I will post a new comment when it will be ready.


> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Maxim Solodovnik (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765619#comment-17765619
 ] 

Maxim Solodovnik commented on TIKA-3948:


[~tallison] I've also noticed the SNAPSHOT is very old and not working

I wrote about it here: 
https://lists.apache.org/thread/5y5wjhjy2kjjfmhyp6jz1komg4fhnjv4 

> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4132) Remove deprecated items and carry out other small breaking changes for 3.x

2023-09-15 Thread Tim Allison (Jira)
Tim Allison created TIKA-4132:
-

 Summary: Remove deprecated items and carry out other small 
breaking changes for 3.x
 Key: TIKA-4132
 URL: https://issues.apache.org/jira/browse/TIKA-4132
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


Let's use this ticket to track removing deprecated bits and making small 
breaking changes to 3.x.

Small breaking changes:
1) move the boilerpipe handler out of tika-parsers into a new boilerpipe module 
underneat tika-handlers?

Deprectated items:
1) remove digesting option from app's cli



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765605#comment-17765605
 ] 

Tim Allison edited comment on TIKA-3948 at 9/15/23 12:49 PM:
-

Manually copied the jar from 
"https://mvnrepository.com/artifact/org.apache.sis.core/sis-cql/2.0-M0070; to 
incubator/build/libs and renamed the jar. Success!  Made a small change to our 
parser, and I'm getting a clean build and successful test on our GeoInfoParser. 
 

Now to see if I get a clean build on the rest of Tika!


was (Author: talli...@mitre.org):
Manually copied the jar from 
"https://mvnrepository.com/artifact/org.apache.sis.core/sis-cql/2.0-M0070; to 
incubator/build/libs and was able to publish to maven local.  Made a small 
change to our parser, and I'm getting a clean build and successful test on our 
GeoInfoParser.

> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765607#comment-17765607
 ] 

Tim Allison commented on TIKA-3948:
---

[~desruisseaux] was the above (having to manually copy a jar to install to 
maven local) a user error or area for improvement in sis?

> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765605#comment-17765605
 ] 

Tim Allison commented on TIKA-3948:
---

Manually copied the jar from 
"https://mvnrepository.com/artifact/org.apache.sis.core/sis-cql/2.0-M0070; to 
incubator/build/libs and was able to publish to maven local.  Made a small 
change to our parser, and I'm getting a clean build and successful test on our 
GeoInfoParser.

> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765604#comment-17765604
 ] 

Tim Allison commented on TIKA-3948:
---

Upgraded gradle and can get a clean sis build.  Not having luck with the 
"publishToMavenLocal" task...

> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765599#comment-17765599
 ] 

Tim Allison commented on TIKA-3948:
---

[~solomax] I turned on the unit test for the parser that uses sis, and I'm 
still getting the javax error.  The problem is that the snapshot that maven is 
pulling -- sis-storage-1.4-20221226.183817-3.jar -- is quite old.  I got build 
errors when I just tried to pull sis main and build it locally.

> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3948) Migrate to jakarta in Tika 3.x

2023-09-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765596#comment-17765596
 ] 

ASF GitHub Bot commented on TIKA-3948:
--

tballison merged PR #1342:
URL: https://github.com/apache/tika/pull/1342




> Migrate to jakarta in Tika 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [tika] tballison merged pull request #1342: [TIKA-3948] switching to the SNAPSHOT version of apache.sis

2023-09-15 Thread via GitHub


tballison merged PR #1342:
URL: https://github.com/apache/tika/pull/1342


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org