[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868678#comment-17868678 ] Nicholas DiPiazza commented on TIKA-4280: - So for tika server we normally produced a jar file Now we will produce a jar file along with a directory of other jar files You can run the server using maven via exec:java And when you build for production, do we have to add some sort of .sh/bat file that shows them how to launch it? > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4286) fix issues where MS graph fetcher is missing deps
Nicholas DiPiazza created TIKA-4286: --- Summary: fix issues where MS graph fetcher is missing deps Key: TIKA-4286 URL: https://issues.apache.org/jira/browse/TIKA-4286 Project: Tika Issue Type: Task Components: tika-pipes Affects Versions: 3.0.0-BETA Reporter: Nicholas DiPiazza when trying to save the MS Graph Fetcher in Tika Grpc, it would error out due to missing classes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4272) create tika docker image for tika-grpc
[ https://issues.apache.org/jira/browse/TIKA-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4272: Description: now that the tika-grpc branch has been merge to main, we need a tika-grpc server image. i thought for a bit about using the same tika docker image as we already use but that is probably not a good idea because there are vastly different jar files needed for tika-grpc was:now that the tika-grpc branch has been merge to main, tika-docker image needs to be changed so that we can use tika-grpc... same thing as tika-server but with the grpc runner instead of the tika rest services > create tika docker image for tika-grpc > -- > > Key: TIKA-4272 > URL: https://issues.apache.org/jira/browse/TIKA-4272 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > now that the tika-grpc branch has been merge to main, we need a tika-grpc > server image. > i thought for a bit about using the same tika docker image as we already use > but that is probably not a good idea because there are vastly different jar > files needed for tika-grpc -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4272) make changes to tika docker image so that tika can run grpc server or rest server
[ https://issues.apache.org/jira/browse/TIKA-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4272: Description: now that the tika-grpc branch has been merge to main, tika-docker image needs to be changed so that we can use tika-grpc... same thing as tika-server but with the grpc runner instead of the tika rest services (was: now that the tika-grpc branch has been merge to main, create a new tika-docker image for tika-grpc... same thing as tika-server but with the grpc runner instead of the tika rest services) > make changes to tika docker image so that tika can run grpc server or rest > server > - > > Key: TIKA-4272 > URL: https://issues.apache.org/jira/browse/TIKA-4272 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > now that the tika-grpc branch has been merge to main, tika-docker image needs > to be changed so that we can use tika-grpc... same thing as tika-server but > with the grpc runner instead of the tika rest services -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4272) create tika docker image for tika-grpc
[ https://issues.apache.org/jira/browse/TIKA-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4272: Summary: create tika docker image for tika-grpc (was: make changes to tika docker image so that tika can run grpc server or rest server) > create tika docker image for tika-grpc > -- > > Key: TIKA-4272 > URL: https://issues.apache.org/jira/browse/TIKA-4272 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > now that the tika-grpc branch has been merge to main, tika-docker image needs > to be changed so that we can use tika-grpc... same thing as tika-server but > with the grpc runner instead of the tika rest services -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4272) make changes to tika docker image so that tika can run grpc server or rest server
[ https://issues.apache.org/jira/browse/TIKA-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4272: Summary: make changes to tika docker image so that tika can run grpc server or rest server (was: create a Docker image for tika-grpc-server) > make changes to tika docker image so that tika can run grpc server or rest > server > - > > Key: TIKA-4272 > URL: https://issues.apache.org/jira/browse/TIKA-4272 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > now that the tika-grpc branch has been merge to main, create a new > tika-docker image for tika-grpc... same thing as tika-server but with the > grpc runner instead of the tika rest services -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4272) create a Docker image for tika-grpc-server
[ https://issues.apache.org/jira/browse/TIKA-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4272: Summary: create a Docker image for tika-grpc-server (was: create an image for tika-grpc-server) > create a Docker image for tika-grpc-server > -- > > Key: TIKA-4272 > URL: https://issues.apache.org/jira/browse/TIKA-4272 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > now that the tika-grpc branch has been merge to main, create a new > tika-docker image for tika-grpc... same thing as tika-server but with the > grpc runner instead of the tika rest services -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4273) create a helm deployment for tika-grpc
[ https://issues.apache.org/jira/browse/TIKA-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4273: Description: after we have created a tika-grpc image, we need to create a deployment in the tika helm chart. > create a helm deployment for tika-grpc > -- > > Key: TIKA-4273 > URL: https://issues.apache.org/jira/browse/TIKA-4273 > Project: Tika > Issue Type: New Feature >Reporter: Nicholas DiPiazza >Priority: Major > > after we have created a tika-grpc image, we need to create a deployment in > the tika helm chart. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4273) create a helm deployment for tika-grpc
Nicholas DiPiazza created TIKA-4273: --- Summary: create a helm deployment for tika-grpc Key: TIKA-4273 URL: https://issues.apache.org/jira/browse/TIKA-4273 Project: Tika Issue Type: New Feature Reporter: Nicholas DiPiazza -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4272) create an image for tika-grpc-server
Nicholas DiPiazza created TIKA-4272: --- Summary: create an image for tika-grpc-server Key: TIKA-4272 URL: https://issues.apache.org/jira/browse/TIKA-4272 Project: Tika Issue Type: New Feature Components: tika-pipes Reporter: Nicholas DiPiazza now that the tika-grpc branch has been merge to main, create a new tika-docker image for tika-grpc... same thing as tika-server but with the grpc runner instead of the tika rest services -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860032#comment-17860032 ] Nicholas DiPiazza commented on TIKA-4251: - I agree with Google format being the new standard, given that wildcard imports are set to . > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860011#comment-17860011 ] Nicholas DiPiazza commented on TIKA-4251: - I volunteer to review the PR thoroughly. Here is how i will do it 1) use intellij to format the code using the checkstyle profile 2) use eclipse to format the code using the checkstyle profile this is 2 different softwares doing the same thing. diff the responses should find minimal to no differences and helps guarantee confidence. > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860005#comment-17860005 ] Nicholas DiPiazza edited comment on TIKA-4251 at 6/25/24 6:42 PM: -- i guess we don't even need the maven plugin then. we can use intellij to format all java source one time. Then use the "format code" option in the git commit dialog so that you always have formatted commits (given that you used intellij to commit). eclipse has this option as well to format on save. same thing as long as they are using eclipse, they will never have checkstyle issues. this provides the "stop having checkstyle back-and-forth that wastes tons of time" issue was (Author: ndipiazza): i guess we don't even need the maven plugin then. we can use intellij to format all java source one time. Then use the "format code" option in the git commit dialog so that you always have formatted commits (given that you used intellij to commit). this provides the "stop having checkstyle back-and-forth that wastes tons of time" issue > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860005#comment-17860005 ] Nicholas DiPiazza edited comment on TIKA-4251 at 6/25/24 6:30 PM: -- i guess we don't even need the maven plugin then. we can use intellij to format all java source one time. Then use the "format code" option in the git commit dialog so that you always have formatted commits (given that you used intellij to commit). this provides the "stop having checkstyle back-and-forth that wastes tons of time" issue was (Author: ndipiazza): i guess we don't even need the maven plugin then. we can use intellij to format all java source one time. Then use the "format code" option in the git commit dialog so that you always have formatted commits (given that you used intellij to commit). this provides the "stop having checkstyle back-and-forth that wastes tons of time) issue > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860005#comment-17860005 ] Nicholas DiPiazza commented on TIKA-4251: - i guess we don't even need the maven plugin then. we can use intellij to format all java source one time. Then use the "format code" option in the git commit dialog so that you always have formatted commits (given that you used intellij to commit). this provides the "stop having checkstyle back-and-forth that wastes tons of time) issue > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860004#comment-17860004 ] Nicholas DiPiazza edited comment on TIKA-4251 at 6/25/24 6:28 PM: -- I think as long as the plugin isn't transparently formatting code after commit, we are mitigating the risk. This becomes a tool you can plugin to a git hook locally and it will produce PRs with formatted code that is going to be reviewed anyway. and the diffs should be very consumable because we eat the 1-time-format cost and now reformatting again should incur no additional changes. was (Author: ndipiazza): I think as long as the plugin isn't transparently formatting code after commit, we are mitigating the risk. This becomes a tool you can plugin to a git hook locally and it will produce PRs with code that is going to be reviewed anyway. and the diffs should be very consumable because we eat the 1-time-format cost and now reformatting again should incur no additional changes. > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860004#comment-17860004 ] Nicholas DiPiazza commented on TIKA-4251: - I think as long as the plugin isn't transparently formatting code after commit, we are mitigating the risk. This becomes a tool you can plugin to a git hook locally and it will produce PRs with code that is going to be reviewed anyway. and the diffs should be very consumable because we eat the 1-time-format cost and now reformatting again should incur no additional changes. > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Tika Grpc Server using Tika Pipes
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859987#comment-17859987 ] Nicholas DiPiazza commented on TIKA-4181: - I will be merging this today. any issues let me know. > Tika Grpc Server using Tika Pipes > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Create a Tika Grpc server. > You should be able to create Tike Pipes fetchers, then use those fetchers. > You can then use those fetchers to FetchAndParse in 3 ways: > * synchronous fashion - you send a single request to fetch a file, and get a > single FetchAndParse response tuple. > * streaming output - you send a single request and stream back the > FetchAndParse response tuple. > * bi-directional streaming - You stream in 1 or more Fetch requests and > stream back FetchAndParse response tuples. > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4247) HttpFetcher - add ability to send request headers
[ https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859985#comment-17859985 ] Nicholas DiPiazza commented on TIKA-4247: - I will be merging this today. any follow-ups or issues let me know. > HttpFetcher - add ability to send request headers > - > > Key: TIKA-4247 > URL: https://issues.apache.org/jira/browse/TIKA-4247 > Project: Tika > Issue Type: New Feature >Reporter: Nicholas DiPiazza >Priority: Major > > add ability to send request headers -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4237) Add JWT authentication ability to the http fetcher
[ https://issues.apache.org/jira/browse/TIKA-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859984#comment-17859984 ] Nicholas DiPiazza commented on TIKA-4237: - i will be merging this shortly. any issues let me know. > Add JWT authentication ability to the http fetcher > -- > > Key: TIKA-4237 > URL: https://issues.apache.org/jira/browse/TIKA-4237 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 3.0.0-BETA >Reporter: Nicholas DiPiazza >Priority: Major > > Add the ability to supply JWT > support both HS256 > and RS256 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4229) add microsoft graph fetcher
[ https://issues.apache.org/jira/browse/TIKA-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859980#comment-17859980 ] Nicholas DiPiazza commented on TIKA-4229: - Will be merging this shortly. if anyone would like any changes let me know and I'll work the changes in over coming week or two, > add microsoft graph fetcher > --- > > Key: TIKA-4229 > URL: https://issues.apache.org/jira/browse/TIKA-4229 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > add a tika pipes fetcher capable of fetching files from MS graph api -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4181) Tika Grpc Server using Tika Pipes
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4181: Summary: Tika Grpc Server using Tika Pipes (was: Grpc + Tika Pipes) > Tika Grpc Server using Tika Pipes > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Create a Tika Grpc server. > You should be able to create Tike Pipes fetchers, then use those fetchers. > You can then use those fetchers to FetchAndParse in 3 ways: > * synchronous fashion - you send a single request to fetch a file, and get a > single FetchAndParse response tuple. > * streaming output - you send a single request and stream back the > FetchAndParse response tuple. > * bi-directional streaming - You stream in 1 or more Fetch requests and > stream back FetchAndParse response tuples. > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4181) Grpc + Tika Pipes
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4181: Description: Create a Tika Grpc server. You should be able to create Tike Pipes fetchers, then use those fetchers. You can then use those fetchers to FetchAndParse in 3 ways: * synchronous fashion - you send a single request to fetch a file, and get a single FetchAndParse response tuple. * streaming output - you send a single request and stream back the FetchAndParse response tuple. * bi-directional streaming - You stream in 1 or more Fetch requests and stream back FetchAndParse response tuples. Requires we create a service contract that specifies the inputs we require from each method. Then we will need to implement the different components with a grpc client generated using the contract. This would enable developers to run tika-pipes as a persistently running daemon instead of just a single batch app, because it can continue to stream out more inputs. !image-2024-02-06-07-54-50-116.png! was: Add full tika-pipes support of grpc * pipe iterator * fetcher * emitter Requires we create a service contract that specifies the inputs we require from each method. Then we will need to implement the different components with a grpc client generated using the contract. This would enable developers to run tika-pipes as a persistently running daemon instead of just a single batch app, because it can continue to stream out more inputs. !image-2024-02-06-07-54-50-116.png! > Grpc + Tika Pipes > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Create a Tika Grpc server. > You should be able to create Tike Pipes fetchers, then use those fetchers. > You can then use those fetchers to FetchAndParse in 3 ways: > * synchronous fashion - you send a single request to fetch a file, and get a > single FetchAndParse response tuple. > * streaming output - you send a single request and stream back the > FetchAndParse response tuple. > * bi-directional streaming - You stream in 1 or more Fetch requests and > stream back FetchAndParse response tuples. > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4181) Grpc + Tika Pipes
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4181: Summary: Grpc + Tika Pipes (was: Grpc + Tika Pipes - pipe iterator and emitter) > Grpc + Tika Pipes > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859757#comment-17859757 ] Nicholas DiPiazza edited comment on TIKA-4251 at 6/24/24 6:35 PM: -- we could keep everything how it is but: * provide instructions how to run the code formatter on the entire repo with google checkstyle. * run it on the entire codebase and commit the now-fully-formatted repo * advise everyone turn on the automatic code formatting in Intellij/Eclipse so that you automatically have your code formatted. Now that plugin doesn't control us so much, but we still have easy way to stay fully formatted so we stop getting the back-and-forth with maven and CI when we forget to format something. was (Author: ndipiazza): we could keep everything how it is but: * provide instructions how to run the code formatter manually * run it on the entire codebase and commit the now-fully-formatted repo * advise everyone turn on the automatic code formatting in Intellij/Eclipse so that you automatically have your code formatted. Now that plugin doesn't control us so much, but we still have easy way to stay fully formatted so we stop getting the back-and-forth with maven and CI when we forget to format something. > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859757#comment-17859757 ] Nicholas DiPiazza commented on TIKA-4251: - we could keep everything how it is but: * provide instructions how to run the code formatter manually * run it on the entire codebase and commit the now-fully-formatted repo * advise everyone turn on the automatic code formatting in Intellij/Eclipse so that you automatically have your code formatted. Now that plugin doesn't control us so much, but we still have easy way to stay fully formatted so we stop getting the back-and-forth with maven and CI when we forget to format something. > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852895#comment-17852895 ] Nicholas DiPiazza commented on TIKA-4243: - new ticket .let's close this out > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza resolved TIKA-4243. - Fix Version/s: 3.0.0 Resolution: Fixed > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4264) Tika Pipes - Structured output (XHTML) support?
Nicholas DiPiazza created TIKA-4264: --- Summary: Tika Pipes - Structured output (XHTML) support? Key: TIKA-4264 URL: https://issues.apache.org/jira/browse/TIKA-4264 Project: Tika Issue Type: Bug Components: tika-pipes Reporter: Nicholas DiPiazza So I am able to use Tika Pipes to extract the text content from a document. But is it possible to use Tika Pipes to obtain structured documents? I believe Tika does this in XHTML. The plain text extracted from the document is great for indexing into search engine. But if you want the structured text output like XHTML? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4262) In pipes XML config, List serializes incorrect causing the parameters to be empty when read
[ https://issues.apache.org/jira/browse/TIKA-4262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza closed TIKA-4262. --- Assignee: Nicholas DiPiazza Resolution: Invalid never mind - this was an issue in my branch reproducing in a crazy way. > In pipes XML config, List serializes incorrect causing the parameters > to be empty when read > --- > > Key: TIKA-4262 > URL: https://issues.apache.org/jira/browse/TIKA-4262 > Project: Tika > Issue Type: Bug > Components: tika-pipes >Reporter: Nicholas DiPiazza >Assignee: Nicholas DiPiazza >Priority: Major > > tika configuration when saving a fetcher with a list of strings will look > like this: > [] > [Authorization: xyz123] > These are invalid format. It's expecting them to be: > > > Autorization: xyz123 > > So the effect of this is all List configs in fetchers are completely > ignored after being saved/re-read. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4262) In pipes XML config, List serializes incorrect causing the parameters to be empty when read
[ https://issues.apache.org/jira/browse/TIKA-4262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4262: Description: tika configuration when saving a fetcher with a list of strings will look like this: [] [Authorization: xyz123] These are invalid format. It's expecting them to be: Autorization: xyz123 So the effect of this is all List configs in fetchers are completely ignored after being saved/re-read. was: tika configuration when saving a fetcher with a list of strings will look like this: [] [Authorization: xyz123] These are invalid format. It's expecting them to be: Autorization: xyz123 > In pipes XML config, List serializes incorrect causing the parameters > to be empty when read > --- > > Key: TIKA-4262 > URL: https://issues.apache.org/jira/browse/TIKA-4262 > Project: Tika > Issue Type: Bug > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > tika configuration when saving a fetcher with a list of strings will look > like this: > [] > [Authorization: xyz123] > These are invalid format. It's expecting them to be: > > > Autorization: xyz123 > > So the effect of this is all List configs in fetchers are completely > ignored after being saved/re-read. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4262) In pipes XML config, List serializes incorrect causing the parameters to be empty when read
Nicholas DiPiazza created TIKA-4262: --- Summary: In pipes XML config, List serializes incorrect causing the parameters to be empty when read Key: TIKA-4262 URL: https://issues.apache.org/jira/browse/TIKA-4262 Project: Tika Issue Type: Bug Components: tika-pipes Reporter: Nicholas DiPiazza tika configuration when saving a fetcher with a list of strings will look like this: [] [Authorization: xyz123] These are invalid format. It's expecting them to be: Autorization: xyz123 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848960#comment-17848960 ] Nicholas DiPiazza commented on TIKA-4243: - Sure that sounds good. When we chat later today/tomorrow let's discuss a high level plan here. I'll take my first stab at this Friday night > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845083#comment-17845083 ] Nicholas DiPiazza commented on TIKA-4252: - even better > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845080#comment-17845080 ] Nicholas DiPiazza commented on TIKA-4252: - Maybe fetchInputMetadata outputMetadata > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845071#comment-17845071 ] Nicholas DiPiazza commented on TIKA-4252: - sure I can do that. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845071#comment-17845071 ] Nicholas DiPiazza edited comment on TIKA-4252 at 5/9/24 5:08 PM: - sure I can do that. if you have a moment please do otherwise will get to it later in week next week was (Author: ndipiazza): sure I can do that. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845061#comment-17845061 ] Nicholas DiPiazza edited comment on TIKA-4252 at 5/9/24 4:50 PM: - What I need is to be able to send "Fetch Metadata" such as a bearer token to a single request per-fetch-request variable. was (Author: ndipiazza): What I need is to be able to send "Fetch Metadata" such as a bearer token to a single request per-fetch-request varaible > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845061#comment-17845061 ] Nicholas DiPiazza edited comment on TIKA-4252 at 5/9/24 4:50 PM: - What I need is to be able to send "Fetch Metadata" such as a bearer token to a single fetch() request per-fetch-request variable. was (Author: ndipiazza): What I need is to be able to send "Fetch Metadata" such as a bearer token to a single request per-fetch-request variable. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845061#comment-17845061 ] Nicholas DiPiazza commented on TIKA-4252: - What I need is to be able to send "Fetch Metadata" such as a bearer token to a single request per-fetch-request varaible > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza closed TIKA-4252. --- Fix Version/s: 3.0.0 Resolution: Fixed > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845010#comment-17845010 ] Nicholas DiPiazza commented on TIKA-4252: - done > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4252: Description: when calling: PipesResult pipesResult = pipesClient.process(new FetchEmitTuple(request.getFetchKey(), new FetchKey(fetcher.getName(), request.getFetchKey()), new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); the tikaMetadata is not present in the fetch data when the fetch method is called. It's OK through this part: UnsynchronizedByteArrayOutputStream bos = UnsynchronizedByteArrayOutputStream.builder().get(); try (ObjectOutputStream objectOutputStream = new ObjectOutputStream(bos)) { objectOutputStream.writeObject(t); } byte[] bytes = bos.toByteArray(); output.write(CALL.getByte()); output.writeInt(bytes.length); output.write(bytes); output.flush(); i verified the bytes have the expected metadata from that point. UPDATE: found issue org.apache.tika.pipes.PipesServer#parseFromTuple is using a new Metadata when it should only use empty metadata if fetch tuple metadata is null. was: when calling: PipesResult pipesResult = pipesClient.process(new FetchEmitTuple(request.getFetchKey(), new FetchKey(fetcher.getName(), request.getFetchKey()), new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); the tikaMetadata is not present in the fetch data when the fetch method is called. It's OK through this part: UnsynchronizedByteArrayOutputStream bos = UnsynchronizedByteArrayOutputStream.builder().get(); try (ObjectOutputStream objectOutputStream = new ObjectOutputStream(bos)) { objectOutputStream.writeObject(t); } byte[] bytes = bos.toByteArray(); output.write(CALL.getByte()); output.writeInt(bytes.length); output.write(bytes); output.flush(); i verified the bytes have the expected metadata from that point. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4252: Description: when calling: PipesResult pipesResult = pipesClient.process(new FetchEmitTuple(request.getFetchKey(), new FetchKey(fetcher.getName(), request.getFetchKey()), new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); the tikaMetadata is not present in the fetch data when the fetch method is called. It's OK through this part: UnsynchronizedByteArrayOutputStream bos = UnsynchronizedByteArrayOutputStream.builder().get(); try (ObjectOutputStream objectOutputStream = new ObjectOutputStream(bos)) { objectOutputStream.writeObject(t); } byte[] bytes = bos.toByteArray(); output.write(CALL.getByte()); output.writeInt(bytes.length); output.write(bytes); output.flush(); i verified the bytes have the expected metadata from that point. was: when calling: PipesResult pipesResult = pipesClient.process(new FetchEmitTuple(request.getFetchKey(), new FetchKey(fetcher.getName(), request.getFetchKey()), new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); the tikaMetadata is not present in the fetch data when the fetch method is called. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) { > objectOutputStream.writeObject(t); > } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
Nicholas DiPiazza created TIKA-4252: --- Summary: PipesClient#process - seems to lose the Fetch input metadata? Key: TIKA-4252 URL: https://issues.apache.org/jira/browse/TIKA-4252 Project: Tika Issue Type: Bug Reporter: Nicholas DiPiazza when calling: PipesResult pipesResult = pipesClient.process(new FetchEmitTuple(request.getFetchKey(), new FetchKey(fetcher.getName(), request.getFetchKey()), new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); the tikaMetadata is not present in the fetch data when the fetch method is called. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842622#comment-17842622 ] Nicholas DiPiazza commented on TIKA-4243: - Kinda seems like it might belong in tika-config module > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842622#comment-17842622 ] Nicholas DiPiazza edited comment on TIKA-4243 at 5/1/24 12:34 PM: -- Kinda seems like it might belong in a new tika-config module was (Author: ndipiazza): Kinda seems like it might belong in tika-config module > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842158#comment-17842158 ] Nicholas DiPiazza edited comment on TIKA-4243 at 4/29/24 8:56 PM: -- this seems like a major feature thing so i would recommend having it go with the tika 3.0.0 release makes sense if the tika 2.0.0 stays compatible was (Author: ndipiazza): this seems like a major feature thing so i would recommend with tika 3.x > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842158#comment-17842158 ] Nicholas DiPiazza commented on TIKA-4243: - this seems like a major feature thing so i would recommend with tika 3.x > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842157#comment-17842157 ] Nicholas DiPiazza commented on TIKA-4243: - [https://github.com/joelittlejohn/jsonschema2pojo |https://github.com/joelittlejohn/jsonschema2pojo] makes it so we can just author some .json schema files in *src/main/jsonschema* and it will automatically create Java files that are part of the classpath It cuts down on unnecessary plumbing code by having to maintain both a JSON Schema file and Pojo. So we get the benefits of JSON schema validation, and automatically generated pojos. > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4247) HttpFetcher - add ability to send request headers
Nicholas DiPiazza created TIKA-4247: --- Summary: HttpFetcher - add ability to send request headers Key: TIKA-4247 URL: https://issues.apache.org/jira/browse/TIKA-4247 Project: Tika Issue Type: New Feature Reporter: Nicholas DiPiazza add ability to send request headers -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4243) tika configuration overhaul
Nicholas DiPiazza created TIKA-4243: --- Summary: tika configuration overhaul Key: TIKA-4243 URL: https://issues.apache.org/jira/browse/TIKA-4243 Project: Tika Issue Type: New Feature Components: config Affects Versions: 3.0.0 Reporter: Nicholas DiPiazza In 3.0.0 when dealing with Tika, it would greatly help to have a Typed Configuration schema. In 3.x can we remove the old way of doing configs and replace with Json Schema? Json Schema can be converted to Pojos using a maven plugin [https://github.com/joelittlejohn/jsonschema2pojo] This automatically creates a Java Pojo model we can use for the configs. This can allow for the legacy tika-config XML to be read and converted to the new pojos easily using an XML mapper so that users don't have to use JSON configurations yet if they do not want. When complete, configurations can be set as XML, JSON or YAML tika-config.xml tika-config.json tika-config.yaml Replace all instances of tika config annotations that used the old syntax, and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4243: Description: In 3.0.0 when dealing with Tika, it would greatly help to have a Typed Configuration schema. In 3.x can we remove the old way of doing configs and replace with Json Schema? Json Schema can be converted to Pojos using a maven plugin [https://github.com/joelittlejohn/jsonschema2pojo] This automatically creates a Java Pojo model we can use for the configs. This can allow for the legacy tika-config XML to be read and converted to the new pojos easily using an XML mapper so that users don't have to use JSON configurations yet if they do not want. When complete, configurations can be set as XML, JSON or YAML tika-config.xml tika-config.json tika-config.yaml Replace all instances of tika config annotations that used the old syntax, and replace with the Pojo model serialized from the xml/json/yaml. was: In 3.0.0 when dealing with Tika, it would greatly help to have a Typed Configuration schema. In 3.x can we remove the old way of doing configs and replace with Json Schema? Json Schema can be converted to Pojos using a maven plugin [https://github.com/joelittlejohn/jsonschema2pojo] This automatically creates a Java Pojo model we can use for the configs. This can allow for the legacy tika-config XML to be read and converted to the new pojos easily using an XML mapper so that users don't have to use JSON configurations yet if they do not want. When complete, configurations can be set as XML, JSON or YAML tika-config.xml tika-config.json tika-config.yaml Replace all instances of tika config annotations that used the old syntax, and replace with the Pojo model serialized from the xml/json/yaml. > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4237) Add JWT authentication ability to the http fetcher
Nicholas DiPiazza created TIKA-4237: --- Summary: Add JWT authentication ability to the http fetcher Key: TIKA-4237 URL: https://issues.apache.org/jira/browse/TIKA-4237 Project: Tika Issue Type: New Feature Components: tika-pipes Affects Versions: 3.0.0-BETA Reporter: Nicholas DiPiazza Add the ability to supply JWT support both HS256 and RS256 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4229) add microsoft graph fetcher
Nicholas DiPiazza created TIKA-4229: --- Summary: add microsoft graph fetcher Key: TIKA-4229 URL: https://issues.apache.org/jira/browse/TIKA-4229 Project: Tika Issue Type: New Feature Components: tika-pipes Reporter: Nicholas DiPiazza add a tika pipes fetcher capable of fetching files from MS graph api -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4181: Attachment: image-2024-02-06-07-54-50-116.png > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4181: Description: Add full tika-pipes support of grpc * pipe iterator * fetcher * emitter Requires we create a service contract that specifies the inputs we require from each method. Then we will need to implement the different components with a grpc client generated using the contract. This would enable developers to run tika-pipes as a persistently running daemon instead of just a single batch app, because it can continue to stream out more inputs. !image-2024-02-06-07-54-50-116.png! was: Add full tika-pipes support of grpc * pipe iterator * fetcher * emitter Requires we create a service contract that specifies the inputs we require from each method. Then we will need to implement the different components with a grpc client generated using the contract. This would enable developers to run tika-pipes as a persistently running daemon instead of just a single batch app, because it can continue to stream out more inputs. > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805762#comment-17805762 ] Nicholas DiPiazza edited comment on TIKA-4181 at 1/11/24 6:25 PM: -- Tika pipes could get a full fledged service that could be tika-server-http2 to accompany tika-server and maybe one day replace it? Not sure the best way to handle packaging the app, but we could create a secondary main method for running the tika-pipes as a grpc service. Then we would create a protobuf contract for each of the new services that we do: * pipe crud operations - create, update, delete, read, list, etc * run a pipe job - takes bidirectional streams of data - incoming=fetch metadata objects, outgoing=emitDocuments ** this will use a configured fetcher So you would then provide a Go example and Java example generated from our protobuf schema. that people could take and use was (Author: ndipiazza): Tika pipes could get a full fledged service that could be tika-server-http2 to accompany tika-server and maybe one day replace it? Not sure the best way to handle packaging the app, but we could create a secondary main method for running the tika-pipes as a grpc service. Then we would create a protobuf contract for each of the new services that we do: * pipe crud operations - create, update, delete, read, list, etc * run a pipe job - takes bidirectional streams of data - incoming=fetch metadata objects, outgoing=emitDocuments So you would then provide a Go example and Java example generated from our protobuf schema. that people could take and use > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805762#comment-17805762 ] Nicholas DiPiazza commented on TIKA-4181: - Tika pipes could get a full fledged service that could be tika-server-http2 to accompany tika-server and maybe one day replace it? Not sure the best way to handle packaging the app, but we could create a secondary main method for running the tika-pipes as a grpc service. Then we would create a protobuf contract for each of the new services that we do: * pipe crud operations - create, update, delete, read, list, etc * run a pipe job - takes bidirectional streams of data - incoming=fetch metadata objects, outgoing=emitDocuments So you would then provide a Go example and Java example generated from our protobuf schema. that people could take and use > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4181: Description: Add full tika-pipes support of grpc * pipe iterator * fetcher * emitter Requires we create a service contract that specifies the inputs we require from each method. Then we will need to implement the different components with a grpc client generated using the contract. This would enable developers to run tika-pipes as a persistently running daemon instead of just a single batch app, because it can continue to stream out more inputs. was: Add full tika-pipes support of grpc * pipe iterator * fetcher * emitter Requires we create a service contract that specifies the inputs we require from each method. Then we will need to implement the different components with a grpc client generated using the contract. > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
Nicholas DiPiazza created TIKA-4181: --- Summary: Grpc + Tika Pipes - pipe iterator and emitter Key: TIKA-4181 URL: https://issues.apache.org/jira/browse/TIKA-4181 Project: Tika Issue Type: New Feature Components: tika-pipes Reporter: Nicholas DiPiazza Add full tika-pipes support of grpc * pipe iterator * fetcher * emitter Requires we create a service contract that specifies the inputs we require from each method. Then we will need to implement the different components with a grpc client generated using the contract. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3979) OneNoteParser - Improve performance for deserialization
[ https://issues.apache.org/jira/browse/TIKA-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3979: Attachment: image-2023-02-25-12-01-40-311.png > OneNoteParser - Improve performance for deserialization > --- > > Key: TIKA-3979 > URL: https://issues.apache.org/jira/browse/TIKA-3979 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.7.0 >Reporter: David Xie >Priority: Major > Attachments: image-2023-02-20-14-42-10-590.png, > image-2023-02-25-12-01-40-311.png > > > We noticed some performance issues specific to parsing OneNote files. Our cpu > profiler reports that the parser spends a lot of time on deserializing byte > arrays (image included below) > !image-2023-02-20-14-42-10-590.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3979) OneNoteParser - Improve performance for deserialization
[ https://issues.apache.org/jira/browse/TIKA-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693512#comment-17693512 ] Nicholas DiPiazza commented on TIKA-3979: - old and new appear to be the same binary equivalent so we are good here and i merged it !image-2023-02-25-12-01-40-311.png! > OneNoteParser - Improve performance for deserialization > --- > > Key: TIKA-3979 > URL: https://issues.apache.org/jira/browse/TIKA-3979 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.7.0 >Reporter: David Xie >Priority: Major > Attachments: image-2023-02-20-14-42-10-590.png, > image-2023-02-25-12-01-40-311.png > > > We noticed some performance issues specific to parsing OneNote files. Our cpu > profiler reports that the parser spends a lot of time on deserializing byte > arrays (image included below) > !image-2023-02-20-14-42-10-590.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3970) Certain OneNote documents produce duplicate text
[ https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692989#comment-17692989 ] Nicholas DiPiazza commented on TIKA-3970: - So on Windows PC I log into [https://account.microsoft.com/services/microsoft365/details#install] Then click where it says Install Office Eventually you should have a copy of office installed on your machine. Then you should be able to open all the files: tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote1.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote2.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote3.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote4.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote2007OrEarlier.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote2016.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNoteEmbeddedWordDoc.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNoteFromOffice365.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNoteFromOffice365-2.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/test-tika-3970-dupetext.one > Certain OneNote documents produce duplicate text > > > Key: TIKA-3970 > URL: https://issues.apache.org/jira/browse/TIKA-3970 > Project: Tika > Issue Type: Bug > Components: app >Affects Versions: 2.7.0 >Reporter: David Avant >Priority: Minor > Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png, > lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one, > lyrics.txt > > > Extracting text from certain OneNote documents produces more text than is > actually in the document. In this case, the OneNote document was created > by opening a Word document and "printing" it to the OneNote. > To reproduce the issue, open the attached "lyrics.one" using the Tika App > version 2.7.0 and view the plain text. Look for the phrase "Sunday > Morning" and observe that there are 14 occurrences. However in the actual > displayed text, it occurs only once. > The original text in this document is only about 12K characters, but the > extracted text from tika is over 300K. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3970) Certain OneNote documents produce duplicate text
[ https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692984#comment-17692984 ] Nicholas DiPiazza commented on TIKA-3970: - > Should we reverse the iteration order of the pages? I notice that we're > getting page2 then page1 in one of our existing tests. So this might be a > feature or something we're missing in our implementation? I couldn't find > anything in the spec about this. Related: I noticed a "page number" property > for each of the page nodes in the attached file. Maybe we could use that info > to order the pages when it exists? Yeah sure! That sounds like a really good idea. > This would require some walking the tree and caching page order. I'm happy to > give it a try. Yeah! that's what I spent a few hours doing with this PR above. I need to spend some more time on it probably i just kinda got the Jira's test case to work. > Side note: I'm still really frustrated that I can't open a bunch of these > files in OneNote even after I set up my Microsoft account and save the files > in OneDrive. Yeah so there are two types of OneNote files, the MS-ONESTORE spec, and the ones that use the alternative packaging MS-FSSHTTPD. If you open a file from onenote office 365, it will use the alternative packaging. If you open a file from onenote from local microsoft office 365, it will use the ms-onestore spec. So I think you might need to grab a copy of MS office: [https://support.microsoft.com/en-us/office/use-the-office-offline-installer-f0a85fe7-118f-41cb-a791-d59cef96ad1c] you could then work with this. > Certain OneNote documents produce duplicate text > > > Key: TIKA-3970 > URL: https://issues.apache.org/jira/browse/TIKA-3970 > Project: Tika > Issue Type: Bug > Components: app >Affects Versions: 2.7.0 >Reporter: David Avant >Priority: Minor > Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png, > lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one, > lyrics.txt > > > Extracting text from certain OneNote documents produces more text than is > actually in the document. In this case, the OneNote document was created > by opening a Word document and "printing" it to the OneNote. > To reproduce the issue, open the attached "lyrics.one" using the Tika App > version 2.7.0 and view the plain text. Look for the phrase "Sunday > Morning" and observe that there are 14 occurrences. However in the actual > displayed text, it occurs only once. > The original text in this document is only about 12K characters, but the > extracted text from tika is over 300K. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3881) fix testAttachingADebuggerOnTheForkedParserShouldWork test - do not use hard coded port
Nicholas DiPiazza created TIKA-3881: --- Summary: fix testAttachingADebuggerOnTheForkedParserShouldWork test - do not use hard coded port Key: TIKA-3881 URL: https://issues.apache.org/jira/browse/TIKA-3881 Project: Tika Issue Type: Test Components: tika-app Reporter: Nicholas DiPiazza testAttachingADebuggerOnTheForkedParserShouldWork is using hard coded port. should look for available one instead and use it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3879) add test containers test for s3 fetcher, emitter and pipe iterators
[ https://issues.apache.org/jira/browse/TIKA-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza resolved TIKA-3879. - Resolution: Implemented > add test containers test for s3 fetcher, emitter and pipe iterators > --- > > Key: TIKA-3879 > URL: https://issues.apache.org/jira/browse/TIKA-3879 > Project: Tika > Issue Type: Test > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > need to add a testcontainers integration test for s3. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3879) add test containers test for s3 fetcher, emitter and pipe iterators
Nicholas DiPiazza created TIKA-3879: --- Summary: add test containers test for s3 fetcher, emitter and pipe iterators Key: TIKA-3879 URL: https://issues.apache.org/jira/browse/TIKA-3879 Project: Tika Issue Type: Test Components: tika-pipes Reporter: Nicholas DiPiazza need to add a testcontainers integration test for s3. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601463#comment-17601463 ] Nicholas DiPiazza commented on TIKA-3835: - Yeah quickly realizing in my case, because i have solr already, it's better to just store the parsed output in solr than s3. though the s3 option is good. so an interface that supports caching where ever you want is good. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the other environments can now > re-use the parsed output - saving days of run time (in my case). > ** In other words, "full crawls" for your initial tika index on duplicate > environments is reduced to cache lookups. > So the process would be > * pipe iterator has the next document: \{lastUpdated,docID} > ** pipe iterator documents have an optional field: *cache* _boolean -_ > default=true. If cache=false, will not cache this doc. > * if parse cache is enabled, *cache* field != false, and parse cache > contains \{lastUpdated,docID} > ** Get \{lastUpdated,docID} document from the cache and push to the emit > queue and return. > * Parse document > * If parse cache is enabled, and *cache* field != false, put into cache > key=\{lastUpdated,docID}, value=\{document,metadata} > ** Additional conditions can dictate what documents we store in the cache > and what ones we don't bother. Such as numBytesInBody, etc. > The cache would need to be disk or network based storage because of the > storage size. In-memory cache would not be feasible. > The parser cache should be based on an interface so that the user can use > several varieties of implementations such as: > * File cache > * S3 implementation cache > * Others.. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578666#comment-17578666 ] Nicholas DiPiazza edited comment on TIKA-3835 at 8/11/22 8:53 PM: -- [~tallison] i was wondering same thing. For now just taking a tuple for a key such as \{fileId, lastUpdatedOn}. we would then turn that into a UUID5 long integer, or just use some string form of it such as `\{fileId}|\{timestamp}`. some file sources actually have a checksum available (box.com has that) in those cases you could use the checksum as the parse cache key. was (Author: JIRAUSER294298): [~tallison] i was wondering same thing. For now just taking a tuple for a key such as \{fileId, lastUpdatedOn}. we would then turn that into a UUID5 long integer, or just use some string form of it such as `\{fileId}|\{timestamp}` > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the other environments can now > re-use the parsed output - saving days of run time (in my case). > ** In other words, "full crawls" for your initial tika index on duplicate > environments is reduced to cache lookups. > So the process would be > * pipe iterator has the next document: \{lastUpdated,docID} > ** pipe iterator documents have an optional field: *cache* _boolean -_ > default=true. If cache=false, will not cache this doc. > * if parse cache is enabled, *cache* field != false, and parse cache > contains \{lastUpdated,docID} > ** Get \{lastUpdated,docID} document from the cache and push to the emit > queue and return. > * Parse document > * If parse cache is enabled, and *cache* field != false, put into cache > key=\{lastUpdated,docID}, value=\{document,metadata} > ** Additional conditions can dictate what documents we store in the cache > and what ones we don't bother. Such as numBytesInBody, etc. > The cache would need to be disk or network based storage because of the > storage size. In-memory cache would not be feasible. > The parser cache should be based on an interface so that the user can use > several varieties of implementations such as: > * File cache > * S3 implementation cache > * Others.. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578666#comment-17578666 ] Nicholas DiPiazza edited comment on TIKA-3835 at 8/11/22 8:52 PM: -- [~tallison] i was wondering same thing. For now just taking a tuple for a key such as \{fileId, lastUpdatedOn}. we would then turn that into a UUID5 long integer, or just use some string form of it such as `\{fileId}|\{timestamp}` was (Author: JIRAUSER294298): [~tallison] i was wondering same thing. For now just taking a tuple for a key such as \{fileId, lastUpdatedOn}. we would then turn that into a UUID5 long integer. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the other environments can now > re-use the parsed output - saving days of run time (in my case). > ** In other words, "full crawls" for your initial tika index on duplicate > environments is reduced to cache lookups. > So the process would be > * pipe iterator has the next document: \{lastUpdated,docID} > ** pipe iterator documents have an optional field: *cache* _boolean -_ > default=true. If cache=false, will not cache this doc. > * if parse cache is enabled, *cache* field != false, and parse cache > contains \{lastUpdated,docID} > ** Get \{lastUpdated,docID} document from the cache and push to the emit > queue and return. > * Parse document > * If parse cache is enabled, and *cache* field != false, put into cache > key=\{lastUpdated,docID}, value=\{document,metadata} > ** Additional conditions can dictate what documents we store in the cache > and what ones we don't bother. Such as numBytesInBody, etc. > The cache would need to be disk or network based storage because of the > storage size. In-memory cache would not be feasible. > The parser cache should be based on an interface so that the user can use > several varieties of implementations such as: > * File cache > * S3 implementation cache > * Others.. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578666#comment-17578666 ] Nicholas DiPiazza commented on TIKA-3835: - [~tallison] i was wondering same thing. For now just taking a tuple for a key such as \{fileId, lastUpdatedOn}. we would then turn that into a UUID5 long integer. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the other environments can now > re-use the parsed output - saving days of run time (in my case). > ** In other words, "full crawls" for your initial tika index on duplicate > environments is reduced to cache lookups. > So the process would be > * pipe iterator has the next document: \{lastUpdated,docID} > ** pipe iterator documents have an optional field: *cache* _boolean -_ > default=true. If cache=false, will not cache this doc. > * if parse cache is enabled, *cache* field != false, and parse cache > contains \{lastUpdated,docID} > ** Get \{lastUpdated,docID} document from the cache and push to the emit > queue and return. > * Parse document > * If parse cache is enabled, and *cache* field != false, put into cache > key=\{lastUpdated,docID}, value=\{document,metadata} > ** Additional conditions can dictate what documents we store in the cache > and what ones we don't bother. Such as numBytesInBody, etc. > The cache would need to be disk or network based storage because of the > storage size. In-memory cache would not be feasible. > The parser cache should be based on an interface so that the user can use > several varieties of implementations such as: > * File cache > * S3 implementation cache > * Others.. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} ** pipe iterator documents have an optional field: *cache* _boolean -_ default=true. If cache=false, will not cache this doc. * if parse cache is enabled, *cache* field != false, and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, and *cache* field != false, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} ** Additional conditions can dictate what documents we store in the cache and what ones we don't bother. Such as numBytesInBody, etc. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} ** pipe iterator documents have an optional field: *cache* _boolean -_ default=true. If cache=false, will not cache this doc. * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, and *cache* field != false, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} ** Additional conditions can dictate what documents we store in the cache and what ones we don't bother. Such as numBytesInBody, etc. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} ** pipe iterator documents have an optional field: *cache* _boolean -_ default=true. If cache=false, will not cache this doc. * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, and *cache* field != false, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} ** Additional conditions can dictate what documents we store in the cache and what ones we don't bother. Such as numBytesInBody, etc. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} ** pipe iterator documents have an optional field: *cache* _boolean -_ default=true. If cache=false, will not cache this doc. * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, and *cache* field != false, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} ** Additional conditions can dictate what documents we store in the cache and what ones we don't bother. Such as numBytesInBody, etc. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. A cache lookup HIT could be pushed could be done via a separate queue so that batching can be utilized asynchronously. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other wo
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} ** pipe iterator documents have an optional field: *cache* _boolean -_ default=true. If cache=false, will not cache this doc. * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, and *cache* field != false, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} ** Additional conditions can dictate what documents we store in the cache and what ones we don't bother. Such as numBytesInBody, etc. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. A cache lookup HIT could be pushed could be done via a separate queue so that batching can be utilized asynchronously. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} ** pipe iterator documents have an optional field: *cache* _boolean -_ default=true. If cache=false, will not cache this doc. * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, and *cache* field != false, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} ** Additional conditions can dictate what documents we store in the cache and what ones we don't bother. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} ** pipe iterator documents have an optional field: *cache* _boolean -_ default=true. If cache=false, will not cache this doc. * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, and *cache* field != false, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} ** Additional conditions can dictate what documents we store in the cache and what ones we don't bother. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} ** pipe iterator documents have an optional field: *cache* _boolean -_ default=true. If cache=false, will not cache this doc. * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated fas
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} ** pipe iterator documents have an optional field: *cache* _boolean -_ default=true. If cache=false, will not cache this doc. * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the other environments can now > re-use the parsed output - saving days of run time (in my case). > ** I
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for people using services especially cloud file services with strict rate limits. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the other environments can now > re-use the parsed output - saving days of run time (in my case). > *
[jira] [Commented] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578591#comment-17578591 ] Nicholas DiPiazza commented on TIKA-3835: - i added a bunch more edits. done. ha sorry if that spammed your email super heavily. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the other environments can now > re-use the parsed output - saving days of run time (in my case). > ** In other words, "full crawls" for your initial tika index on duplicate > environments is reduced to cache lookups. > So the process would be > * pipe iterator has the next document: \{lastUpdated,docID} > * if parse cache is enabled and parse cache contains \{lastUpdated,docID} > ** Get \{lastUpdated,docID} document from the cache and push to the emit > queue and return. > * Parse document > * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, > value=\{document,metadata} > The cache would need to be disk or network based storage because of the > storage size. In-memory cache would not be feasible. > The parser cache should be based on an interface so that the user can use > several varieties of implementations such as: > * File cache > * S3 implementation cache > * Others.. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for people using services especially cloud file services with strict rate limits. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will p
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Get \{lastUpdated,docID} document from the cache and push to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The cache would need to be disk or network based storage because of the storage size. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the other environments can now > re-use the parsed output - saving days of run ti
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The cache would need to be disk or network based storage because of the storage size. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the other environments can now > re-use the parsed output - saving days of run time (in my case). > ** In other words, "full crawls" for your initial tika index on duplicate > environments is reduced to cach
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the other environments can now > re-use the parsed output - saving days of run time (in my case). > ** In other words, "full crawls" for your initial tika index on duplicate > environments is reduced to cache lookups. > So the process would be > * pipe iterator has the next document: \{lastUpdat
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reducded to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the other environments can now > re-use the parsed output - saving days run time (in my case). > ** In other words, "full crawls" for your initial tika index on duplicate > environments is reduced to cache lookups. > So the process would be > * pipe iterator has the next document: \{lastUpdated,do
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse. In other words, skip the fetch+parse if you did it previously. Benefits of this: * When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important. * Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days run time (in my case). ** In other words, "full crawls" for your initial tika index on duplicate environments is reducded to cache lookups. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > These archived results can be returned in the case that the same exact > version of a document had already been parsed previously, pull the parsed > output from a "parse cache" instead of repeating the fetch+parse. > In other words, skip the fetch+parse if you did it previously. > Benefits of this: > * When the tika pipe fetcher is using a cloud service, documents are rate > limited heavily. So if you manage to get a document and parse it, storing it > for future use is very important. > * Multi tier environments can be populated faster. Example: You are pulling > data from an app in dev, staging and production. When you run the tika pipe > job, it will parse each document 1 time. All the other environments can now > re-use the parsed output - saving days run time (in my case). > ** In other words, "full crawls" for your initial tika index on duplicate > environments is reducded to cache lookups. > So the process would be > * pipe iterator has the next document: \{lastUpdated,docID} > * if parse cache is enabled and parse cache contains \{lastUpdated,docID} > ** Emit to the document to the emit queue and return. > * Parse document > * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, > value=\{document,metadata} > This will drastically improve full crawl times for customers using services > especially cloud file services with strict rate limits. > The parser cache should be based on an interface so that the user can use > several varieties of implementations such as: > * File cache > * S3 implementation cache > * Others.. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578583#comment-17578583 ] Nicholas DiPiazza edited comment on TIKA-3835 at 8/11/22 5:37 PM: -- Yes good point. I didn't point out some important details I attempted to add them to the bottom of description. did that help clarify some? was (Author: JIRAUSER294298): Yes good point. I didn't point out some important details I attempted to add them to the bottom of that. did that help clarify some? > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > So the process would be > * pipe iterator has the next document: \{lastUpdated,docID} > * if parse cache is enabled and parse cache contains \{lastUpdated,docID} > ** Emit to the document to the emit queue and return. > * Parse document > * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, > value=\{document,metadata} > This will drastically improve full crawl times for customers using services > especially cloud file services with strict rate limits. > The parser cache should be based on an interface so that the user can use > several varieties of implementations such as: > * File cache > * S3 implementation cache > * Others.. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578583#comment-17578583 ] Nicholas DiPiazza commented on TIKA-3835: - Yes good point. I didn't point out some important details I attempted to add them to the bottom of that. did that help clarify some? > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > So the process would be > * pipe iterator has the next document: \{lastUpdated,docID} > * if parse cache is enabled and parse cache contains \{lastUpdated,docID} > ** Emit to the document to the emit queue and return. > * Parse document > * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, > value=\{document,metadata} > This will drastically improve full crawl times for customers using services > especially cloud file services with strict rate limits. > The parser cache should be based on an interface so that the user can use > several varieties of implementations such as: > * File cache > * S3 implementation cache > * Others.. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Description: Tika pipes should have an optional configuration to archive parsed results. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. The parser cache should be based on an interface so that the user can use several varieties of implementations such as: * File cache * S3 implementation cache * Others.. was: Tika pipes should have an optional configuration to archive parsed results. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > So the process would be > * pipe iterator has the next document: \{lastUpdated,docID} > * if parse cache is enabled and parse cache contains \{lastUpdated,docID} > ** Emit to the document to the emit queue and return. > * Parse document > * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, > value=\{document,metadata} > This will drastically improve full crawl times for customers using services > especially cloud file services with strict rate limits. > The parser cache should be based on an interface so that the user can use > several varieties of implementations such as: > * File cache > * S3 implementation cache > * Others.. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed
[ https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3835: Summary: tika pipes parse cache - avoid re-parsing content that has not changed (was: parse cache - avoid re-parsing content that has not changed) > tika pipes parse cache - avoid re-parsing content that has not changed > -- > > Key: TIKA-3835 > URL: https://issues.apache.org/jira/browse/TIKA-3835 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.2.0 >Reporter: Nicholas DiPiazza >Priority: Major > > Tika pipes should have an optional configuration to archive parsed results. > So the process would be > * pipe iterator has the next document: \{lastUpdated,docID} > * if parse cache is enabled and parse cache contains \{lastUpdated,docID} > ** Emit to the document to the emit queue and return. > * Parse document > * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, > value=\{document,metadata} > This will drastically improve full crawl times for customers using services > especially cloud file services with strict rate limits. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3835) parse cache - avoid re-parsing content that has not changed
Nicholas DiPiazza created TIKA-3835: --- Summary: parse cache - avoid re-parsing content that has not changed Key: TIKA-3835 URL: https://issues.apache.org/jira/browse/TIKA-3835 Project: Tika Issue Type: New Feature Components: tika-pipes Affects Versions: 2.2.0 Reporter: Nicholas DiPiazza Tika pipes should have an optional configuration to archive parsed results. So the process would be * pipe iterator has the next document: \{lastUpdated,docID} * if parse cache is enabled and parse cache contains \{lastUpdated,docID} ** Emit to the document to the emit queue and return. * Parse document * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, value=\{document,metadata} This will drastically improve full crawl times for customers using services especially cloud file services with strict rate limits. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3821) Pulsar Tika Pipes Support
Nicholas DiPiazza created TIKA-3821: --- Summary: Pulsar Tika Pipes Support Key: TIKA-3821 URL: https://issues.apache.org/jira/browse/TIKA-3821 Project: Tika Issue Type: New Feature Components: tika-pipes Affects Versions: 2.4.1 Reporter: Nicholas DiPiazza add kafka support to tika pipes: * kafka pipe iterator * kafka emitter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3821) Pulsar Tika Pipes Support
[ https://issues.apache.org/jira/browse/TIKA-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-3821: Description: add pulsar support to tika pipes: * pulsar pipe iterator * pulsar emitter was: add kafka support to tika pipes: * kafka pipe iterator * kafka emitter > Pulsar Tika Pipes Support > - > > Key: TIKA-3821 > URL: https://issues.apache.org/jira/browse/TIKA-3821 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Affects Versions: 2.4.1 >Reporter: Nicholas DiPiazza >Priority: Major > > add pulsar support to tika pipes: > * pulsar pipe iterator > * pulsar emitter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3820) Kafka Tika Pipes Support
Nicholas DiPiazza created TIKA-3820: --- Summary: Kafka Tika Pipes Support Key: TIKA-3820 URL: https://issues.apache.org/jira/browse/TIKA-3820 Project: Tika Issue Type: New Feature Components: tika-pipes Affects Versions: 2.4.1 Reporter: Nicholas DiPiazza add kafka support to tika pipes: * kafka pipe iterator * kafka emitter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)
[ https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526632#comment-17526632 ] Nicholas DiPiazza edited comment on TIKA-3725 at 4/22/22 7:03 PM: -- [~tallison] in my case I have a bunch of other deployments and statefulsets that are all using JWT across all inner-pod communication. so in my case having the ability to be consistent with those would be nice. was (Author: ndipiazza_gmail): [~tallison] in my case I have a bunch of other deployments and statefulsets that are all using JWT to keep all inner-pod communication. so in my case having the ability to be consistent with those would be nice. > Add Authorization to Tika Server (Suggest Basic to start off with) > -- > > Key: TIKA-3725 > URL: https://issues.apache.org/jira/browse/TIKA-3725 > Project: Tika > Issue Type: New Feature > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Priority: Minor > > I would be good to get some Authentication/Authorization added to TIKA server > to be able to add another layer of security around the Tika Server Rest > service. > This could become a rabbit hole with the number of options available around > Authentication/Authorization (Oauth, OpenId etc) so suggest as a starter > basic Auth is added. > How to store user(s)/password suggest looking at how other apache products do > the same? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)
[ https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526632#comment-17526632 ] Nicholas DiPiazza commented on TIKA-3725: - [~tallison] in my case I have a bunch of other deployments and statefulsets that are all using JWT to keep all inner-pod communication. so in my case having the ability to be consistent with those would be nice. > Add Authorization to Tika Server (Suggest Basic to start off with) > -- > > Key: TIKA-3725 > URL: https://issues.apache.org/jira/browse/TIKA-3725 > Project: Tika > Issue Type: New Feature > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Priority: Minor > > I would be good to get some Authentication/Authorization added to TIKA server > to be able to add another layer of security around the Tika Server Rest > service. > This could become a rabbit hole with the number of options available around > Authentication/Authorization (Oauth, OpenId etc) so suggest as a starter > basic Auth is added. > How to store user(s)/password suggest looking at how other apache products do > the same? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)
[ https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526557#comment-17526557 ] Nicholas DiPiazza commented on TIKA-3725: - I am a couple weeks out of needing this too, and I'll need JWT auth. can add it if someone hasn't already. > Add Authorization to Tika Server (Suggest Basic to start off with) > -- > > Key: TIKA-3725 > URL: https://issues.apache.org/jira/browse/TIKA-3725 > Project: Tika > Issue Type: New Feature > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Priority: Minor > > I would be good to get some Authentication/Authorization added to TIKA server > to be able to add another layer of security around the Tika Server Rest > service. > This could become a rabbit hole with the number of options available around > Authentication/Authorization (Oauth, OpenId etc) so suggest as a starter > basic Auth is added. > How to store user(s)/password suggest looking at how other apache products do > the same? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3659) SMB/NFS support
[ https://issues.apache.org/jira/browse/TIKA-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480514#comment-17480514 ] Nicholas DiPiazza commented on TIKA-3659: - I will need to add a `smbj` client for SMB2/3 and `jcifs` for legacy SMB1 shares soon for a project I'll be doing. Will probably have this done in the next few weeks. > SMB/NFS support > --- > > Key: TIKA-3659 > URL: https://issues.apache.org/jira/browse/TIKA-3659 > Project: Tika > Issue Type: Wish > Components: handler, parser >Affects Versions: 2.2.1 >Reporter: Michael >Priority: Minor > Labels: features > > as referenced on > [https://discuss.opendistrocommunity.dev/t/alternative-to-fscrawler-in-opensearch/7157/11] > please add support for the tika-pipes on SMB/NFS collections > > Thank you in advance! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3446) OneNote - look into adding support for OneNote 365 documents
[ https://issues.apache.org/jira/browse/TIKA-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455997#comment-17455997 ] Nicholas DiPiazza commented on TIKA-3446: - [~tallison] Do I need to do anything to make sure this gets in both 1.x and 2.x? > OneNote - look into adding support for OneNote 365 documents > > > Key: TIKA-3446 > URL: https://issues.apache.org/jira/browse/TIKA-3446 > Project: Tika > Issue Type: New Feature > Components: parser >Affects Versions: 1.27 >Reporter: Nicholas DiPiazza >Assignee: Nicholas DiPiazza >Priority: Major > > While doing some parsing of OneNote documents, I was investigating a slew of > them that did not seem to parse very well. > When I did some digging, I found out that these documents were generated from > SharePoint Online. > I had hoped that OneNote documents generated from SharePoint Online would > just be the same as OnPrem OneNote documents from 2016, 2019 etc. > But turns out this is NOT the case. > I checked out the Microsoft specification MS-ONESTORE and found that the > documents do not match the specifications that are published. > Opened a community post: [Looking for the MS spec for OneNote 365 version - > Microsoft > Q&A|https://docs.microsoft.com/en-us/answers/questions/436943/looking-for-the-ms-spec-for-onenote-365-version-1.html] > And also opened an internal ticket with Microsoft. > They will be responding soon with an analysis of my issue and we'll see if > there is anything we can do. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (TIKA-3561) Tika throwing java.lang.OutOfMemoryError
[ https://issues.apache.org/jira/browse/TIKA-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421848#comment-17421848 ] Nicholas DiPiazza edited comment on TIKA-3561 at 9/29/21, 1:06 AM: --- Tika needs a lot of memory to parse a nested file like this, and time as well. In order to give it a chance, you need to extract the file first. Then I cranked up -Xmx20G and it took about 1.5 minutes to parse. eventually dumped out this parsed output (out.json compressed in the following tar ball) [^out.tar.gz] was (Author: ndipiazza_gmail): Tika needs a lot of memory to parse a nested file like this, and time as well. I cranked up -Xmx20G and it took about 1.5 minutes to parse. eventually dumped out this parsed output (out.json compressed in the following tar ball) [^out.tar.gz] > Tika throwing java.lang.OutOfMemoryError > > > Key: TIKA-3561 > URL: https://issues.apache.org/jira/browse/TIKA-3561 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.1.0 >Reporter: Abha >Priority: Major > Attachments: Item.zip, out.tar.gz > > > Getting Fatal Exception when processing the attached document \{item.content > sub doc name is item.xlsx}. > Below is the exception log - > Caused by: java.lang.OutOfMemoryError: Java heap spaceCaused by: > java.lang.OutOfMemoryError: Java heap space at > java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:77) at > org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:177) at > org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149) at > org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.(ZipArchiveFakeEntry.java:47) > at > org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:53) > at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:106) at > org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307) at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) -- This message was sent by Atlassian Jira (v8.3.4#803005)