[jira] [Created] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-05-23 Thread Tim Allison (Jira)
Tim Allison created TIKA-4260:
-

 Summary: Add parse context to the fetcher interface in 3.x
 Key: TIKA-4260
 URL: https://issues.apache.org/jira/browse/TIKA-4260
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4259) Decouple xml parser stuff from ParseContext

2024-05-23 Thread Tim Allison (Jira)
Tim Allison created TIKA-4259:
-

 Summary: Decouple xml parser stuff from ParseContext
 Key: TIKA-4259
 URL: https://issues.apache.org/jira/browse/TIKA-4259
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


ParseContext has some xmlreader convenience methods. We should move those to 
XMLReaderUtils in 3.x to simplify ParseContext's api.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849114#comment-17849114
 ] 

Tim Allison commented on TIKA-4243:
---

I'm going to start working on PRs that will be generally helpful for the above, 
but they'll still be useful if we all choose a different direction. I'll hold 
off on the core work for a bit in case there are objections or better ways 
forward.

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849108#comment-17849108
 ] 

Tim Allison commented on TIKA-4243:
---

The downsides we see:
a) if we there's agreement to add jackson-annotations to tika-core, we add a 
few kb to tika-core
b) we're at risk of having jackson-annotations sprinkled throughout our 
codebase on the XConfig classes, but this is basically where we have our own 
@Field annotations now. So break even?
c)  Customized classes that need to be passed via the ParseContext will need to 
be serializable to be used in tika-server, tika-pipes...etc. anything that 
allows for configuration.

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849103#comment-17849103
 ] 

Tim Allison commented on TIKA-4243:
---

Proposed basic roadmap:

Serialize ParseContext as is...
Allow for serialization of current XConfigs, eg. PDFParserConfig, etc.
Add creation of parsers with e.g. new PDFParser(ParseContext context).
Wire config stuff into tika-server, tika-pipes, tika-app
Merge tika-grpc-server with new config options

This would require serialization of classes that users want to be able to 
configure + serialization.

This would allow us to get rid of all of our custom serialization stuff for 
Tika 4.x.


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849101#comment-17849101
 ] 

Tim Allison commented on TIKA-4243:
---

Fellow devs, in chatting with Nicholas, we're thinking that it would be useful 
for a number of use cases to overhaul the configuration in tika 3.x. We'd leave 
in legacy behavior obviously!

To move forward, we're thinking about using ParseContext for both 
initialization and per-parse control in tika-server, tika-pipes and probably 
tika-app.

To do this, serializing ParseContext is really important. Are we ok with adding 
jackson-annotations to tika-core? We wouldn't add any other jackson to 
tika-core!!!

Alternatively, we could probably write wrappers needed for tika-core objects 
and put those wrappers in tika-serialization.


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4258) Multi-arch support for docker images

2024-05-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4258.
---
Resolution: Fixed

Just pushed 2.9.2.1/*-latest

Thank you, all!

> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4255) TextAndCSVParser ignores Metadata.CONTENT_ENCODING

2024-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847980#comment-17847980
 ] 

Tim Allison commented on TIKA-4255:
---

Thank you for opening this PR. Are you able to add a small unit test to confirm 
behavior? 

I can't tell from the above if you're setting {{CONTENT_TYPE_USER_OVERRIDE}} or 
if you're setting CONTENT_TYPE and ENCODING?

It looks like the code is trying to pull the encoding from the 
{{CONTENT_TYPE_USER_OVERRIDE}}. 

> TextAndCSVParser ignores Metadata.CONTENT_ENCODING
> --
>
> Key: TIKA-4255
> URL: https://issues.apache.org/jira/browse/TIKA-4255
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.6.0, 3.0.0-BETA, 2.9.2
>Reporter: Axel Dörfler
>Priority: Major
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> I pass a text to the auto-detect parser that just contains the text "ETL". I 
> pass on content type, and content encoding information via Metadata.
> However, TextAndCSVParser ignores the provided encoding (since CSVParams has 
> not provided via TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE), and chooses 
> to rather detect it by itself. Turns out it detects some IBM424 hebrew 
> charset, and uses that which results in a kind of surprising output.
> Tested with the mentioned versions, though the bug should be much older 
> already.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4256) Allow inlining of ocr'd text in container document

2024-05-20 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4256.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Allow inlining of ocr'd text in container document
> --
>
> Key: TIKA-4256
> URL: https://issues.apache.org/jira/browse/TIKA-4256
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 3.0.0
>
>
> For legacy tika, we're inlining all content from embedded files including ocr 
> content of embedded images.
> However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
> stitch inlined image ocr text back into the container file's content.
> For example, if a docx has an image in it and tesseract is invoked, the 
> structure will notionally be:
> [
>   { "type":"docx", "content": "main content of the file"}
>   { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> It would be useful to allow an option to inline the extracted text in the 
> parent document. I think we want to keep the embedded inline object so that 
> we don't lose metadata from it. So I propose this kind of output:
> [
>   { "type":"docx", "content": "main content of the file  type=\"ocr\">ocr'd content"}
>   { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> This proposal includes the ocr'd content marked by  in the container 
> file, and it includes the ocr'd text in the embedded image.
> For now this proposal does not include inlining ocr'd text from thumbnails. 
> We can do that on a later ticket if desired.
> This will allow a more intuitive search for non-file forensics users and will 
> be more similar to what we're doing with rendering a page -> ocr in PDFs when 
> that is configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847950#comment-17847950
 ] 

Tim Allison commented on TIKA-4258:
---

I'm sure I'll need to modify the PR when I actually go to run it, but it 
shouldn't be much different. I'll also update the "how to release" notes.

> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847949#comment-17847949
 ] 

Tim Allison commented on TIKA-4258:
---

Let's give it a day for fellow devs to weigh in. If there are no objections, 
I'll make the multi-arch release of 2.9.2.1 and 'latest' in ~24 hours from now?

> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847943#comment-17847943
 ] 

Tim Allison commented on TIKA-4258:
---

And here's the full version: 
https://hub.docker.com/layers/apache/tika/2.9.2-alpha-multi-arch-full/images/sha256-70ca1efb4686145feb88033ec9db441dd89f6ee126271d872354767d33af7ff9?context=explore


> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847931#comment-17847931
 ] 

Tim Allison commented on TIKA-4243:
---

Separately, but related to this and also to TIKA-4252 -- should we allow for 
the serialization of ParseContext?

That would be the more natural way to set per-parse settings. That would also 
allow us to pass in an object that a fetcher could use for authentication, and 
we'd keep the Metadata object for, well, Metadata... in TIKA-4252. :D

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847883#comment-17847883
 ] 

Tim Allison commented on TIKA-4258:
---

Helpful links from #infra:

https://infra.apache.org/docker-hub-policy.html
https://infra.apache.org/github-actions-secrets.html

> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847882#comment-17847882
 ] 

Tim Allison commented on TIKA-4258:
---

If fellow devs with better knowledge of github actions and docker hub want to 
jump in, please do! [~lewismc] [~davemeikle]?

> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread Tim Allison (Jira)
Tim Allison created TIKA-4258:
-

 Summary: Multi-arch support for docker images
 Key: TIKA-4258
 URL: https://issues.apache.org/jira/browse/TIKA-4258
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


This is a post-PR ticket to cover the work on: 
https://github.com/apache/tika-docker/pull/19

Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4256) Allow inlining of ocr'd text in container document

2024-05-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4256:
--
Description: 
For legacy tika, we're inlining all content from embedded files including ocr 
content of embedded images.

However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
stitch inlined image ocr text back into the container file's content.

For example, if a docx has an image in it and tesseract is invoked, the 
structure will notionally be:
[
  { "type":"docx", "content": "main content of the file"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

It would be useful to allow an option to inline the extracted text in the 
parent document. I think we want to keep the embedded inline object so that we 
don't lose metadata from it. So I propose this kind of output:

[
  { "type":"docx", "content": "main content of the file ocr'd content"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

This proposal includes the ocr'd content marked by  in the container 
file, and it includes the ocr'd text in the embedded image.

For now this proposal does not include inlining ocr'd text from thumbnails. We 
can do that on a later ticket if desired.

This will allow a more intuitive search for non-file forensics users and will 
be more similar to what we're doing with rendering a page -> ocr in PDFs when 
that is configured.



  was:
For legacy tika, we're inlining all content from embedded files including ocr 
content of embedded images.

However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
stitch inlined image ocr text back into the container file's content.

For example, if a docx has an image in it and tesseract is invoked, the 
structure will notionally be:
[
  { "type":"docx", "content": "main content of the file"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

It would be useful to allow an option to inline the extracted text in the 
parent document. I think we want to keep the embedded inline object so that we 
don't lose metadata from it. So I propose this kind of output:

[
  { "type":"docx", "content": "main content of the file ocr'd content"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

This proposal includes the ocr'd content marked by  in the container 
file, and it includes the ocr'd text in the embedded image.

This will allow a more intuitive search for non-file forensics users and will 
be more similar to what we're doing with rendering a page -> ocr in PDFs when 
that is configured.




> Allow inlining of ocr'd text in container document
> --
>
> Key: TIKA-4256
> URL: https://issues.apache.org/jira/browse/TIKA-4256
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> For legacy tika, we're inlining all content from embedded files including ocr 
> content of embedded images.
> However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
> stitch inlined image ocr text back into the container file's content.
> For example, if a docx has an image in it and tesseract is invoked, the 
> structure will notionally be:
> [
>   { "type":"docx", "content": "main content of the file"}
>   { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> It would be useful to allow an option to inline the extracted text in the 
> parent document. I think we want to keep the embedded inline object so that 
> we don't lose metadata from it. So I propose this kind of output:
> [
>   { "type":"docx", "content": "main content of the file  type=\"ocr\">ocr'd content"}
>   { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> This proposal includes the ocr'd content marked by  in the container 
> file, and it includes the ocr'd text in the embedded image.
> For now this proposal does not include inlining ocr'd text from thumbnails. 
> We can do that on a later ticket if desired.
> This will allow a more intuitive search for non-file forensics users and will 
> be more similar to what we're doing with rendering a page -> ocr in PDFs when 
> that is configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4256) Allow inlining of ocr'd text in container document

2024-05-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4256:
--
Description: 
For legacy tika, we're inlining all content from embedded files including ocr 
content of embedded images.

However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
stitch inlined image ocr text back into the container file's content.

For example, if a docx has an image in it and tesseract is invoked, the 
structure will notionally be:
[
  { "type":"docx", "content": "main content of the file"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

It would be useful to allow an option to inline the extracted text in the 
parent document. I think we want to keep the embedded inline object so that we 
don't lose metadata from it. So I propose this kind of output:

[
  { "type":"docx", "content": "main content of the file ocr'd content"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

This proposal includes the ocr'd content marked by  in the container 
file, and it includes the ocr'd text in the embedded image.

This will allow a more intuitive search for non-file forensics users and will 
be more similar to what we're doing with rendering a page -> ocr in PDFs when 
that is configured.



  was:
For legacy tika, we're inlining all content from embedded files including ocr 
content of embedded images.

However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
stitch inlined image ocr text back into the container file's content.

For example, if a docx has an image in it and tesseract is invoked, the 
structure will notionally be:
[
  { "type":"docx", "content": "main content of the file"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

It would be useful to allow an option to inline the extracted text in the 
parent document. I think we want to keep the embedded inline object so that we 
don't lose metadata from it. So I propose this kind of output:

[
  { "type":"docx", "content": "main content of the file ocr'd content"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

This will allow a more intuitive search for non-file forensics users and will 
be more similar to what we're doing with rendering a page -> ocr in PDFs when 
that is configured.


> Allow inlining of ocr'd text in container document
> --
>
> Key: TIKA-4256
> URL: https://issues.apache.org/jira/browse/TIKA-4256
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> For legacy tika, we're inlining all content from embedded files including ocr 
> content of embedded images.
> However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
> stitch inlined image ocr text back into the container file's content.
> For example, if a docx has an image in it and tesseract is invoked, the 
> structure will notionally be:
> [
>   { "type":"docx", "content": "main content of the file"}
>   { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> It would be useful to allow an option to inline the extracted text in the 
> parent document. I think we want to keep the embedded inline object so that 
> we don't lose metadata from it. So I propose this kind of output:
> [
>   { "type":"docx", "content": "main content of the file  type=\"ocr\">ocr'd content"}
>   { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> This proposal includes the ocr'd content marked by  in the container 
> file, and it includes the ocr'd text in the embedded image.
> This will allow a more intuitive search for non-file forensics users and will 
> be more similar to what we're doing with rendering a page -> ocr in PDFs when 
> that is configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4256) Allow inlining of ocr'd text in container document

2024-05-16 Thread Tim Allison (Jira)
Tim Allison created TIKA-4256:
-

 Summary: Allow inlining of ocr'd text in container document
 Key: TIKA-4256
 URL: https://issues.apache.org/jira/browse/TIKA-4256
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


For legacy tika, we're inlining all content from embedded files including ocr 
content of embedded images.

However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
stitch inlined image ocr text back into the container file's content.

For example, if a docx has an image in it and tesseract is invoked, the 
structure will notionally be:
[
  { "type":"docx", "content": "main content of the file"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

It would be useful to allow an option to inline the extracted text in the 
parent document. I think we want to keep the embedded inline object so that we 
don't lose metadata from it. So I propose this kind of output:

[
  { "type":"docx", "content": "main content of the file ocr'd content"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

This will allow a more intuitive search for non-file forensics users and will 
be more similar to what we're doing with rendering a page -> ocr in PDFs when 
that is configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4137) Building current Tika main branch fails under Java 20/21

2024-05-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846697#comment-17846697
 ] 

Tim Allison commented on TIKA-4137:
---

Y, done just now.

> Building current Tika main branch fails under Java 20/21
> 
>
> Key: TIKA-4137
> URL: https://issues.apache.org/jira/browse/TIKA-4137
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.0.0-BETA
>Reporter: Thorsten Heit
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.3
>
> Attachments: org.apache.tika.server.core.StackTraceOffTest.txt, 
> org.apache.tika.server.core.StackTraceTest.txt, 
> org.apache.tika.server.core.TikaResourceFetcherTest.txt, 
> org.apache.tika.server.core.TikaResourceTest.txt
>
>
> When I execute "mvn verify" on the current main branch using  Java 11 or Java 
> 17, the build completes. With Java 20 and 21 the same command fails because 
> now a couple of JUnit tests in tika-server-core fail:
> {noformat}
> (...)
> [INFO] Running org.apache.tika.server.core.StackTraceTest
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.034 
> s <<< FAILURE! -- in org.apache.tika.server.core.StackTraceTest
> [ERROR] org.apache.tika.server.core.StackTraceTest.testEmptyParser -- Time 
> elapsed: 0.007 s <<< FAILURE!
> org.opentest4j.AssertionFailedError: bad type: /tika ==> expected: <200> but 
> was: <500>
>   at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at 
> org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
>   at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
>   at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:559)
>   at 
> org.apache.tika.server.core.StackTraceTest.testEmptyParser(StackTraceTest.java:132)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:580)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
> WARN  [main] 21:28:26,651 org.apache.tika.pipes.PipesServer received -1 from 
> client; shutting down
> ERROR [main] 21:28:26,652 org.apache.tika.pipes.PipesServer exiting: 1
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Failures: 
> [ERROR]   StackTraceOffTest.testEmptyParser:137 bad type: /tika ==> expected: 
> <200> but was: <500>
> [ERROR]   StackTraceTest.testEmptyParser:132 bad type: /tika ==> expected: 
> <200> but was: <500>
> [ERROR]   
> TikaResourceFetcherTest.testHeader:101->CXFTestBase.assertContains:66 hello 
> world not found in:
>  xmlns="http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
> 
>  content="org.apache.tika.parser.DefaultParser"/>
> 
>  content="org.apache.tika.parser.mock.MockParser"/>
> 
> 
> 
> 
> ==> expected:  but was: 
> [ERROR]   
> TikaResourceFetcherTest.testQueryPart:109->CXFTestBase.assertContains:66 
> hello world not found in:
>  xmlns="http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
> 
>  content="org.apache.tika.parser.DefaultParser"/>
> 
>  content="org.apache.tika.parser.mock.MockParser"/>
> 
> 
> 
> 
> ==> expected:  but was: 
> [ERROR]   TikaResourceTest.testHeaders:91->CXFTestBase.assertContains:66 
>  not found in:
>  xmlns="http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
> 
>  content="org.apache.tika.parser.DefaultParser"/>
> 
>  content="org.apache.tika.parser.mock.MockParser"/>
> 
> 
> 
>  content="R5FG5V2U44YXOZTMKGVNTTSPGLF2JH ==> expected:  but was: 
> [ERROR]   
> TikaResourceTest.testNoWriteLimitOnStreamingWrite:187->CXFTestBase.assertContains:66
>  separation. not found in:
> http://www.w3.org/1999/xhtml;>
> 
> 
>  content="org.apache.tika.parser.DefaultParser"/>
>  content="org.apache.tika.parser.mock.MockParser"/>
> 
>  content="AQWEMUMSJVFZWYGM4TKXRTQ5Q436X4DN"/>
> 
> 

[jira] [Updated] (TIKA-4137) Building current Tika main branch fails under Java 20/21

2024-05-15 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4137:
--
Fix Version/s: 2.9.3

> Building current Tika main branch fails under Java 20/21
> 
>
> Key: TIKA-4137
> URL: https://issues.apache.org/jira/browse/TIKA-4137
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.0.0-BETA
>Reporter: Thorsten Heit
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.3
>
> Attachments: org.apache.tika.server.core.StackTraceOffTest.txt, 
> org.apache.tika.server.core.StackTraceTest.txt, 
> org.apache.tika.server.core.TikaResourceFetcherTest.txt, 
> org.apache.tika.server.core.TikaResourceTest.txt
>
>
> When I execute "mvn verify" on the current main branch using  Java 11 or Java 
> 17, the build completes. With Java 20 and 21 the same command fails because 
> now a couple of JUnit tests in tika-server-core fail:
> {noformat}
> (...)
> [INFO] Running org.apache.tika.server.core.StackTraceTest
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.034 
> s <<< FAILURE! -- in org.apache.tika.server.core.StackTraceTest
> [ERROR] org.apache.tika.server.core.StackTraceTest.testEmptyParser -- Time 
> elapsed: 0.007 s <<< FAILURE!
> org.opentest4j.AssertionFailedError: bad type: /tika ==> expected: <200> but 
> was: <500>
>   at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at 
> org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
>   at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
>   at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:559)
>   at 
> org.apache.tika.server.core.StackTraceTest.testEmptyParser(StackTraceTest.java:132)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:580)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
> WARN  [main] 21:28:26,651 org.apache.tika.pipes.PipesServer received -1 from 
> client; shutting down
> ERROR [main] 21:28:26,652 org.apache.tika.pipes.PipesServer exiting: 1
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Failures: 
> [ERROR]   StackTraceOffTest.testEmptyParser:137 bad type: /tika ==> expected: 
> <200> but was: <500>
> [ERROR]   StackTraceTest.testEmptyParser:132 bad type: /tika ==> expected: 
> <200> but was: <500>
> [ERROR]   
> TikaResourceFetcherTest.testHeader:101->CXFTestBase.assertContains:66 hello 
> world not found in:
>  xmlns="http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
> 
>  content="org.apache.tika.parser.DefaultParser"/>
> 
>  content="org.apache.tika.parser.mock.MockParser"/>
> 
> 
> 
> 
> ==> expected:  but was: 
> [ERROR]   
> TikaResourceFetcherTest.testQueryPart:109->CXFTestBase.assertContains:66 
> hello world not found in:
>  xmlns="http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
> 
>  content="org.apache.tika.parser.DefaultParser"/>
> 
>  content="org.apache.tika.parser.mock.MockParser"/>
> 
> 
> 
> 
> ==> expected:  but was: 
> [ERROR]   TikaResourceTest.testHeaders:91->CXFTestBase.assertContains:66 
>  not found in:
>  xmlns="http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
> 
>  content="org.apache.tika.parser.DefaultParser"/>
> 
>  content="org.apache.tika.parser.mock.MockParser"/>
> 
> 
> 
>  content="R5FG5V2U44YXOZTMKGVNTTSPGLF2JH ==> expected:  but was: 
> [ERROR]   
> TikaResourceTest.testNoWriteLimitOnStreamingWrite:187->CXFTestBase.assertContains:66
>  separation. not found in:
> http://www.w3.org/1999/xhtml;>
> 
> 
>  content="org.apache.tika.parser.DefaultParser"/>
>  content="org.apache.tika.parser.mock.MockParser"/>
> 
>  content="AQWEMUMSJVFZWYGM4TKXRTQ5Q436X4DN"/>
> 
> 

[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845081#comment-17845081
 ] 

Tim Allison commented on TIKA-4252:
---

fetchRequestMetadata, fetchResponseMetadata?

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845072#comment-17845072
 ] 

Tim Allison edited comment on TIKA-4252 at 5/9/24 5:14 PM:
---

fetcher.fetch(String key, Metadata writeMetadata, Metadata readMetadata) ?

where writeMetadata is what you want to send to the fetcher and readMetadata is 
the metadata as it currently is, e.g. metadata gathered from the fetcher and 
propagated through to the results?

Better names?

toMetadata, fromMetadata?


was (Author: talli...@mitre.org):
fetcher.fetch(String key, Metadata writeMetadata, Metadata readMetadata) ?

where writeMetadata is what you want to send to the fetcher and readMetadata is 
the metadata as it currently is, e.g. metadata gathered from the fetcher and 
propagated through to the results?

Better names?

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845072#comment-17845072
 ] 

Tim Allison commented on TIKA-4252:
---

fetcher.fetch(String key, Metadata writeMetadata, Metadata readMetadata) ?

where writeMetadata is what you want to send to the fetcher and readMetadata is 
the metadata as it currently is, e.g. metadata gathered from the fetcher and 
propagated through to the results?

Better names?

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845068#comment-17845068
 ] 

Tim Allison commented on TIKA-4252:
---

Should we add an optional Metadata object to the FetchKey. We could have this 
propagate through to the fetcher but never be confused with provenance data nor 
extracted content.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845062#comment-17845062
 ] 

Tim Allison commented on TIKA-4252:
---

K, but you don't want that coming back and being populated in the results, 
right?

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845051#comment-17845051
 ] 

Tim Allison commented on TIKA-4252:
---

Or, if you mean that metadata gathered from the fetcher isn't making it through 
into the results, I just added a few tests for that.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845048#comment-17845048
 ] 

Tim Allison commented on TIKA-4252:
---

My initial thought for injecting user metadata was to pass through provenance 
information etc into the final document/output.

I wanted to make sure that metadata extracted during the parse didn't overwrite 
user injected data so... I injected the user metadata _after_ the parse and 
after the metadata filters were applied.

[~ndipiazza], to confirm, you want to inject user metadata so that it is 
available for the fetchers?

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845047#comment-17845047
 ] 

Tim Allison commented on TIKA-4252:
---

I opened this branch: https://github.com/apache/tika/tree/TIKA-4252

This reverts the change I suggested above and adds a unit test to confirm 
behavior that I incorrectly thought was reported as broken.

Now that I actually read this issue more carefully -- sorry -- it looks like 
the issue is that you want to pass user-injected metadata through to the 
fetcher. 

The problem is _NOT_ that you are not getting user-injected metadata back 
through the results.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-4252:
---

I pointed you to the wrong part of the code ... sorry. The design goal was to 
overwrite the extracted metadata with user metadata after the parse and before 
the emit.

This is what's leading to the new failing unit test in tika-server's 
testConcatenated();

I'm taking a look now.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4253) Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit tests

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845022#comment-17845022
 ] 

Tim Allison commented on TIKA-4253:
---

This is happening in the unit tests because there are multiple service loading 
files on the classpath in tika-parsers-standard from the different modules.

We could change the list to a set in 
ServiceLoader#identifyStaticServiceProviders.

> Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit 
> tests
> ---
>
> Key: TIKA-4253
> URL: https://issues.apache.org/jira/browse/TIKA-4253
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I haven't checked 2.x yet, but it looks like the AutoDetectParser with and 
> without a custom TikaConfig is loading parsers twice at least in 
> tika-parsers-standard unit tests.
> We should figure out if this is happening elsewhere in tika-app and 
> tika-server and fix it where we find it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4253) Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit tests

2024-05-09 Thread Tim Allison (Jira)
Tim Allison created TIKA-4253:
-

 Summary: Duplicate parsers loaded in AutoDetectParser in 3.x at 
least in some unit tests
 Key: TIKA-4253
 URL: https://issues.apache.org/jira/browse/TIKA-4253
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


I haven't checked 2.x yet, but it looks like the AutoDetectParser with and 
without a custom TikaConfig is loading parsers twice at least in 
tika-parsers-standard unit tests.

We should figure out if this is happening elsewhere in tika-app and tika-server 
and fix it where we find it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844998#comment-17844998
 ] 

Tim Allison commented on TIKA-4252:
---

Good catch: 
https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java#L465

Shall I fix it or are you in progress?

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos)) {
>                 objectOutputStream.writeObject(t);
>             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844976#comment-17844976
 ] 

Tim Allison edited comment on TIKA-4250 at 5/9/24 12:59 PM:


libpst issue opened: https://github.com/pst-format/libpst/issues/14



was (Author: talli...@mitre.org):
libpff issue opened: https://github.com/libyal/libpff/issues/128

Note that I found non-deterministic behavior even without debug on -- sometimes 
I got 7 extracted files, sometimes 8. I noted that in the issue. 

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844976#comment-17844976
 ] 

Tim Allison commented on TIKA-4250:
---

libpff issue opened: https://github.com/libyal/libpff/issues/128

Note that I found non-deterministic behavior even without debug on -- sometimes 
I got 7 extracted files, sometimes 8. I noted that in the issue. 

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-05-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4251:
--
Description: 
I was recently working a bit on incubator-stormcrawler, and I noticed that they 
are using cosium's git-code-format-maven-plugin: 
https://github.com/Cosium/git-code-format-maven-plugin

I was initially annoyed that I couldn't quickly figure out what I had to fix to 
make the linter happyl, but then I realized there was a magic command: {{mvn 
git-code-format:format-code}} which just fixed the code so that the linter 
passed. 

The one drawback I found is that it does not fix nor does it alert on wildcard 
imports.  We could still use checkstyle for that but only have one rule for 
checkstyle.

The other drawback is that there is not a lot of room for variation from 
google's style. This may actually be a benefit, too, of course.

I just ran this on {{tika-core}} here: 
https://github.com/apache/tika/tree/google-java-format

What would you think about making this change for 3.x?

  was:
I was recently working a bit on incubator-stormcrawler, and I noticed that they 
are using cosium's git-code-format-maven-plugin: 
https://github.com/Cosium/git-code-format-maven-plugin

I was initially annoyed that I couldn't quickly figure out how my code changes 
were causing the build to fail, but then I realized there was a magic command: 
{{mvn git-code-format:format-code}} which just fixed the code so that the 
linter passed. 

The one drawback I found is that it does not fix nor does it alert on wildcard 
imports.  We could still use checkstyle for that but only have one rule for 
checkstyle.

The other drawback is that there is not a lot of room for variation from 
google's style. This may actually be a benefit, too, of course.

I just ran this on {{tika-core}} here: 
https://github.com/apache/tika/tree/google-java-format

What would you think about making this change for 3.x?


> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-05-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4251:
--
Summary: [DISCUSS] move to cosium's git-code-format-maven-plugin with 
google-java-format  (was: [DISCUSS] move to cosium's 
git-code-format-maven-plugin)

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out how my code 
> changes were causing the build to fail, but then I realized there was a magic 
> command: {{mvn git-code-format:format-code}} which just fixed the code so 
> that the linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin

2024-05-06 Thread Tim Allison (Jira)
Tim Allison created TIKA-4251:
-

 Summary: [DISCUSS] move to cosium's git-code-format-maven-plugin
 Key: TIKA-4251
 URL: https://issues.apache.org/jira/browse/TIKA-4251
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


I was recently working a bit on incubator-stormcrawler, and I noticed that they 
are using cosium's git-code-format-maven-plugin: 
https://github.com/Cosium/git-code-format-maven-plugin

I was initially annoyed that I couldn't quickly figure out how my code changes 
were causing the build to fail, but then I realized there was a magic command: 
{{mvn git-code-format:format-code}} which just fixed the code so that the 
linter passed. 

The one drawback I found is that it does not fix nor does it alert on wildcard 
imports.  We could still use checkstyle for that but only have one rule for 
checkstyle.

The other drawback is that there is not a lot of room for variation from 
google's style. This may actually be a benefit, too, of course.

I just ran this on {{tika-core}} here: 
https://github.com/apache/tika/tree/google-java-format

What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843746#comment-17843746
 ] 

Tim Allison edited comment on TIKA-4250 at 5/6/24 5:03 PM:
---

Wait, so, on licensing can we include a wrapper for either libpst or libpff 
because libpst is GPL 2 and libpff is GPL 3 
(https://www.apache.org/licenses/GPL-compatibility.html)?

[~nick] is the answer obvious or should I open a ticket on LEGAL?


was (Author: talli...@mitre.org):
Wait, so, on licensing can we include a wrapper for either libpst or libpff 
because libpst is GPL 2 and libpff is GPL 3 
(https://www.apache.org/licenses/GPL-compatibility.html)?

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843798#comment-17843798
 ] 

Tim Allison edited comment on TIKA-4250 at 5/6/24 5:02 PM:
---

So, I caught an example of libpst not exporting an attachment in an msg file 
via our unit test file (testPST.pst). The attached msg should contain an 
embedded msg that includes a docx. Via a hex editor, I can see that there is no 
embedded msg in 8.msg, whereas the structure is correctly maintained in 8.eml.


was (Author: talli...@mitre.org):
So, I caught an example of libpst not reading an attachment in our unit test 
file (testPST.pst). The attached msg should contain an embedded msg that 
includes a docx. Via a hex editor, I can see that there is no embedded msg in 
8.msg, whereas the structure is correctly maintained in 8.eml.

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843798#comment-17843798
 ] 

Tim Allison commented on TIKA-4250:
---

So, I caught an example of libpst not reading an attachment in our unit test 
file (testPST.pst). The attached msg should contain an embedded msg that 
includes a docx. Via a hex editor, I can see that there is no embedded msg in 
8.msg, whereas the structure is correctly maintained in 8.eml.

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4250:
--
Attachment: 8.eml

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4250:
--
Attachment: 8.msg

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843740#comment-17843740
 ] 

Tim Allison edited comment on TIKA-4250 at 5/6/24 1:02 PM:
---

Wow. This is super helpful. I guess the answer is to run all three? 

But seriously, should we fork java-libpst and add your extra fixes? Or, better, 
try to push them into the actual java-libpst? Longer term, we could see about 
adding meetings, Documents, Notes, vCalendars  and vJournals into that fork?

This gives some confidence that we were doing will with java-libpst.

In my own, much more modest testing (one large pst), I noticed that libpst had 
fewer emails and fewer attachments. What was weird, though, was that the number 
of emails was equal or closer to equal when I turned debug-mode on on libpst. 
It was much, much slower, but it got the same number of emails as java-libpst.

Again, thank you!


was (Author: talli...@mitre.org):
Wow. This is super helpful. I guess the answer is to run all three? 

But seriously, should we fork java-libpst and add your extra fixes? Or, better, 
try to push them into the actual java-libpst? Longer term, we could see about 
adding meetings, Documents, Notes and Vjournals into that fork?

This gives some confidence that we were doing will with java-libpst.

In my own, much more modest testing (one large pst), I noticed that libpst had 
fewer emails and fewer attachments. What was weird, though, was that the number 
of emails was equal or closer to equal when I turned debug-mode on on libpst. 
It was much, much slower, but it got the same number of emails as java-libpst.

Again, thank you!

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843428#comment-17843428
 ] 

Tim Allison commented on TIKA-4250:
---

Given your experience, I think it would be valuable to add libpff as an 
optional PST parser to Tika. 

Advanced users can use libpff for content+metadata and then libpst to generate 
msg files -- with the understanding that some msg files can't be generated 
(e.g. when libpst fails).

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843361#comment-17843361
 ] 

Tim Allison commented on TIKA-4250:
---

Hahahahaha. I figured you'd have input on this [~lfcnassif]! 

Y, libpst is aging but it is slightly fresher than java-libpst. :/

I'll take a look at your wrapper. Thank you!

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 2.9.2 version

2024-05-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843217#comment-17843217
 ] 

Tim Allison commented on TIKA-4249:
---

> Crystal ball is murky on the timing of the next 2.x and 3.x releases.

I don't know.

> EML file is treating it as text file in 2.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Fix For: 3.0.0, 2.9.3
>
>
> We recently upgrade from 2.9.0 to 2.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4250) Add a libpst-based parser

2024-05-02 Thread Tim Allison (Jira)
Tim Allison created TIKA-4250:
-

 Summary: Add a libpst-based parser
 Key: TIKA-4250
 URL: https://issues.apache.org/jira/browse/TIKA-4250
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


We currently use the com.pff Java-based PST parser for PST files. It would be 
useful to add a wrapper for libpst as an optional parser. 

One of the benefits of libpst is that it creates .eml or .msg files from the 
PST records. This is critical for those who want the original bytes from 
embedded files. Obv, PST doesn't store eml or msg, but some users want the 
"original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 2.9.2 version

2024-05-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842745#comment-17842745
 ] 

Tim Allison commented on TIKA-4249:
---

Version numbers for the fix are noted above: 2.9.3 and 3.0.0 (probably 
3.0.0-BETA2 first?). We recently released 2.9.2. Crystal ball is murky on the 
timing of the next 2.x and 3.x releases.

> EML file is treating it as text file in 2.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Fix For: 3.0.0, 2.9.3
>
>
> We recently upgrade from 2.9.0 to 2.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842605#comment-17842605
 ] 

Tim Allison commented on TIKA-4243:
---

Do we put it in tika-serialization or a new module?

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-05-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842604#comment-17842604
 ] 

Tim Allison commented on TIKA-4249:
---

The example file shared was actually kind of weird. I looked like an mbox file 
but didn't have the "From " headers. It was just a concatenation of regular 
rfc822 with new lines between them.

This is now fixed in 2.x and 3.x. Thank you for opening this issue [~Vamsi452]!

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Fix For: 3.0.0, 2.9.3
>
>
> We recently upgrade from 2.9.0 to 2.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4249) EML file is treating it as text file in 2.9.2 version

2024-05-01 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4249:
--
Summary: EML file is treating it as text file in 2.9.2 version  (was: EML 
file is treating it as text file in 3.9.2 version)

> EML file is treating it as text file in 2.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Fix For: 3.0.0, 2.9.3
>
>
> We recently upgrade from 2.9.0 to 2.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-05-01 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4249.
---
Fix Version/s: 3.0.0
   2.9.3
   Resolution: Fixed

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Fix For: 3.0.0, 2.9.3
>
>
> We recently upgrade from 2.9.0 to 2.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842405#comment-17842405
 ] 

Tim Allison commented on TIKA-4249:
---

Files never cease to amaze!

Thank you. Onwards!

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Attachments: Email_Received.txt
>
>
> We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842402#comment-17842402
 ] 

Tim Allison commented on TIKA-4249:
---

Modifying the first hit from {{offset="0"}} to {{offset="0:3"}} works.

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Attachments: Email_Received.txt
>
>
> We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842401#comment-17842401
 ] 

Tim Allison commented on TIKA-4249:
---

I'm guessing you mean 2.9.0->2.9.2.

The challenge with this file is that there's a UTF-8 bom at the beginning of 
the file so that our matching on, e.g. "From:" at offset 0 does not work.

[~nick], any recommendations?

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Attachments: Email_Received.txt
>
>
> We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4248) Improve PST handling of attachments

2024-04-29 Thread Tim Allison (Jira)
Tim Allison created TIKA-4248:
-

 Summary: Improve PST handling of attachments
 Key: TIKA-4248
 URL: https://issues.apache.org/jira/browse/TIKA-4248
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


The PST parser doesn't handle attachments in quite the same way as other 
parsers which hinders analysis of attachments.

The problem is that the PST parser handles the text content of an email and the 
embedded attachments. And, the PST parser processes attachments before the main 
body. These two features make the normal patterns for embedded attachments 
break down in the RecursiveParserWrapper. For example, when the attachments are 
being processed, the RecursiveParserWrapper can't figure out what the path will 
be through the "body" because that hasn't been parsed yet.

We should probably create a PSTMailItemParser that handles the content and the 
attachments like other parsers so that embedded paths can be maintained.

This will be a breaking change, and I'm targeting it only to the 3.x branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841252#comment-17841252
 ] 

Tim Allison commented on TIKA-4243:
---

https://json-schema.org/learn/getting-started-step-by-step

Yes, please. This. Without breaking changes and without adding dependencies to 
tika-core. :D

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841242#comment-17841242
 ] 

Tim Allison edited comment on TIKA-4243 at 4/26/24 1:32 PM:


I really, really want to clean up our configuration, and moving to JSON makes 
sense. 

I agree we need to support the legacy config of 2.x in 3.x.

Is there a reason not to use plain old Jackson databind? What does 
jsonschema2pojo buy us?

Will this new capability live in tika-serialization?

It will be great to convert these config objects to Records in Java 17, er Tika 
4.x?

Would this allow us to get rid of our, ahem, baroque config processing code and 
still read 2.x configs?  I admit responsibility for the baroque config stuff, 
and I would really appreciate the opportunity to get rid of it asap... as long 
as we have backwards compatibility.

Thank you [~ndipiazza]!


was (Author: talli...@mitre.org):
I really, really want to clean up our configuration, and moving to JSON makes 
sense. 

I agree we need to support the legacy config of 2.x in 3.x.

Is there a reason not to use plain old Jackson databind? What does 
jsonschema2pojo buy us?

Will this new capability live in tika-serialization?

It will be great to convert these config objects to Records in Java 17, er Tika 
4.x?

Thank you [~ndipiazza]!

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841243#comment-17841243
 ] 

Tim Allison commented on TIKA-4243:
---

Oh, sorry. Does this break anything? Can we add this as a new capability to 
Tika 3.x after 3.x is released, or do we need to break APIs in a 3.0.0-BETA2 
release first?

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841242#comment-17841242
 ] 

Tim Allison commented on TIKA-4243:
---

I really, really want to clean up our configuration, and moving to JSON makes 
sense. 

I agree we need to support the legacy config of 2.x in 3.x.

Is there a reason not to use plain old Jackson databind? What does 
jsonschema2pojo buy us?

Will this new capability live in tika-serialization?

It will be great to convert these config objects to Records in Java 17, er Tika 
4.x?

Thank you [~ndipiazza]!

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841221#comment-17841221
 ] 

Tim Allison edited comment on TIKA-4245 at 4/26/24 1:23 PM:


Oops, sorry. I didn't realize you sent your tika-config.xml. Y, one option is 
to turn off the HtmlEncodingDetector.

I confirmed that works on _this_ file.

Separately, see slide 19 of this presentation for some examples of when the 
HTMLEncodingDetector is a bad idea: 
https://www.slideshare.net/slideshow/evaluating-text-extraction-at-scale-a-case-study-from-apache-tika/238979661


was (Author: talli...@mitre.org):
Oops, sorry. I didn't realize you sent your tika-config.xml. Y, one option is 
to turn off the HtmlEncodingDetector.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841221#comment-17841221
 ] 

Tim Allison commented on TIKA-4245:
---

Oops, sorry. I didn't realize you sent your tika-config.xml. Y, one option is 
to turn off the HtmlEncodingDetector.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841220#comment-17841220
 ] 

Tim Allison commented on TIKA-4245:
---

This is an ongoing area for improvement in Tika.

The algorithm is pick the first non-null charset. The default charset 
detectors: html tags, Mozilla's UniversalCharDet, ICU4j (if memory serves). So, 
Tika is configured to trust the charset in the html if it exists. If you want 
to turn off this behavior and go with a purely statistical detector, you can 
configure UniversalCharDet and then ICU4j.

The solution that has been in the back of my mind for a long time now is a 
charset detector that runs the three detectors and then extracts text from an 
initial chunk of the document. It then picks the charset with the lowest out of 
vocabulary statistic. This is not yet implemented.

If you want to turn off the html tag detector, I can send a link.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html

2024-04-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4244.
---
Fix Version/s: 3.0.0
   2.9.3
   Resolution: Fixed

Thank you [~boomxlucifer]!

> Tika idenifies MIME type of ics files with html content as text/html
> 
>
> Key: TIKA-4244
> URL: https://issues.apache.org/jira/browse/TIKA-4244
> Project: Tika
>  Issue Type: Bug
>Reporter: Kartik Jain
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
> Attachments: Sample.ics
>
>
> When tika-core detect(InputStream input, Metadata metadata) API is used to 
> determimne the MIME type of an ics file, it returns media type `text/html`, 
> rather it should've `text/calendar`.
> For .ics files that have HTML content in them (additional attribute 
> X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such 
> files as text/html, ideally, it should come up as text/calendar, but 
> according to tika core text/html is not in the base types of text/calendar so 
> it doesn't consider the text/calendar type, however for all ics files MIME 
> type should be text/calendar



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html

2024-04-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840852#comment-17840852
 ] 

Tim Allison commented on TIKA-4244:
---

Thank you [~boomxlucifer] for finding this and reporting it. The problem is 
that we were too strict in how close the "VERSION:2.0" had to be to the top of 
the file. I've fixed that in the above PR.

> Tika idenifies MIME type of ics files with html content as text/html
> 
>
> Key: TIKA-4244
> URL: https://issues.apache.org/jira/browse/TIKA-4244
> Project: Tika
>  Issue Type: Bug
>Reporter: Kartik Jain
>Priority: Major
> Attachments: Sample.ics
>
>
> When tika-core detect(InputStream input, Metadata metadata) API is used to 
> determimne the MIME type of an ics file, it returns media type `text/html`, 
> rather it should've `text/calendar`.
> For .ics files that have HTML content in them (additional attribute 
> X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such 
> files as text/html, ideally, it should come up as text/calendar, but 
> according to tika core text/html is not in the base types of text/calendar so 
> it doesn't consider the text/calendar type, however for all ics files MIME 
> type should be text/calendar



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0

2024-04-22 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839780#comment-17839780
 ] 

Tim Allison commented on TIKA-4166:
---

 Thank you!

> dependency updates for Tika 3.0
> ---
>
> Key: TIKA-4166
> URL: https://issues.apache.org/jira/browse/TIKA-4166
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4242) Tika depends on non-existing plexus-utils version

2024-04-17 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4242.
---
Resolution: Fixed

> Tika depends on non-existing plexus-utils version
> -
>
> Key: TIKA-4242
> URL: https://issues.apache.org/jira/browse/TIKA-4242
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Björn Kautler
>Priority: Major
>
> In [https://github.com/apache/tika/pull/1461] [~tallison] moved the versions 
> to Maven properties, but unfortunately he thereby upgraded {{plexus-utils}} 
> to {{5.0.0}} which does not exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4242) Tika depends on non-existing plexus-utils version

2024-04-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838260#comment-17838260
 ] 

Tim Allison commented on TIKA-4242:
---

Looks like the reason we haven't found this problem is that we don't use it.

I _think_ we used to specify a version for that because a long time ago junrar 
used to bring it in. I don't see it anymore in {{dependency:tree}}. Can we just 
get rid of it entirely?

Thank you for opening the PR!

> Tika depends on non-existing plexus-utils version
> -
>
> Key: TIKA-4242
> URL: https://issues.apache.org/jira/browse/TIKA-4242
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Björn Kautler
>Priority: Major
>
> In [https://github.com/apache/tika/pull/1461] [~tallison] moved the versions 
> to Maven properties, but unfortunately he thereby upgraded {{plexus-utils}} 
> to {{5.0.0}} which does not exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4241) Consider handling LibreOffice's /AdditionalStreams "hybrid PDF" attachment embedding in PDFs

2024-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837806#comment-17837806
 ] 

Tim Allison commented on TIKA-4241:
---

They add a custom key in the trailer {{/AdditionalStreams}} whose value is a 
COSArray of [ mime, ref to bytes ].

> Consider handling LibreOffice's /AdditionalStreams "hybrid PDF" attachment 
> embedding in PDFs
> 
>
> Key: TIKA-4241
> URL: https://issues.apache.org/jira/browse/TIKA-4241
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: testPDF_additionalStreams.pdf
>
>
> Some info here: 
> https://stackoverflow.com/questions/67358370/what-the-standard-used-by-a-hybrid-pdf-file
> This looks like a one-off kind of LibreOffice thing that is probably not in 
> the spec... I haven't actually checked, but I trust mkl.
> Is it worth it?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4241) Consider handling LibreOffice's /AdditionalStreams "hybrid PDF" attachment embedding in PDFs

2024-04-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4241:
--
Attachment: testPDF_additionalStreams.pdf

> Consider handling LibreOffice's /AdditionalStreams "hybrid PDF" attachment 
> embedding in PDFs
> 
>
> Key: TIKA-4241
> URL: https://issues.apache.org/jira/browse/TIKA-4241
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: testPDF_additionalStreams.pdf
>
>
> Some info here: 
> https://stackoverflow.com/questions/67358370/what-the-standard-used-by-a-hybrid-pdf-file
> This looks like a one-off kind of LibreOffice thing that is probably not in 
> the spec... I haven't actually checked, but I trust mkl.
> Is it worth it?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4241) Consider handling LibreOffice's /AdditionalStreams "hybrid PDF" attachment embedding in PDFs

2024-04-16 Thread Tim Allison (Jira)
Tim Allison created TIKA-4241:
-

 Summary: Consider handling LibreOffice's /AdditionalStreams 
"hybrid PDF" attachment embedding in PDFs
 Key: TIKA-4241
 URL: https://issues.apache.org/jira/browse/TIKA-4241
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


Some info here: 
https://stackoverflow.com/questions/67358370/what-the-standard-used-by-a-hybrid-pdf-file

This looks like a one-off kind of thing that is probably not in the spec... I 
haven't actually checked, but I trust mkl.

Is it worth it?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4241) Consider handling LibreOffice's /AdditionalStreams "hybrid PDF" attachment embedding in PDFs

2024-04-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4241:
--
Description: 
Some info here: 
https://stackoverflow.com/questions/67358370/what-the-standard-used-by-a-hybrid-pdf-file

This looks like a one-off kind of LibreOffice thing that is probably not in the 
spec... I haven't actually checked, but I trust mkl.

Is it worth it?

  was:
Some info here: 
https://stackoverflow.com/questions/67358370/what-the-standard-used-by-a-hybrid-pdf-file

This looks like a one-off kind of thing that is probably not in the spec... I 
haven't actually checked, but I trust mkl.

Is it worth it?


> Consider handling LibreOffice's /AdditionalStreams "hybrid PDF" attachment 
> embedding in PDFs
> 
>
> Key: TIKA-4241
> URL: https://issues.apache.org/jira/browse/TIKA-4241
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> Some info here: 
> https://stackoverflow.com/questions/67358370/what-the-standard-used-by-a-hybrid-pdf-file
> This looks like a one-off kind of LibreOffice thing that is probably not in 
> the spec... I haven't actually checked, but I trust mkl.
> Is it worth it?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4240) Change dependabot to weekly

2024-04-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836228#comment-17836228
 ] 

Tim Allison commented on TIKA-4240:
---

Thank you, [~tilman]! Should I revert to daily?

> Change dependabot to weekly
> ---
>
> Key: TIKA-4240
> URL: https://issues.apache.org/jira/browse/TIKA-4240
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tim Allison
>Priority: Trivial
>
> On the list, I proposed this change. Some were in favor of dropping it back 
> to monthly. [~tilman] made the argument for the benefit of seeing problems 
> quickly and also acknowledged that it is a burden to merge the daily PRs.
> I propose bumping dependabot back to weekly for a bit, and we'll see how it 
> works as a middle ground.
> If anyone feels strongly about moving back to daily, we can do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4240) Change dependabot to weekly

2024-04-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4240.
---
Resolution: Fixed

Let's see how this goes. Thank you!

> Change dependabot to weekly
> ---
>
> Key: TIKA-4240
> URL: https://issues.apache.org/jira/browse/TIKA-4240
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> On the list, I proposed this change. Some were in favor of dropping it back 
> to monthly. [~tilman] made the argument for the benefit of seeing problems 
> quickly and also acknowledged that it is a burden to merge the daily PRs.
> I propose bumping dependabot back to weekly for a bit, and we'll see how it 
> works as a middle ground.
> If anyone feels strongly about moving back to daily, we can do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4240) Change dependabot to weekly

2024-04-11 Thread Tim Allison (Jira)
Tim Allison created TIKA-4240:
-

 Summary: Change dependabot to weekly
 Key: TIKA-4240
 URL: https://issues.apache.org/jira/browse/TIKA-4240
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


On the list, I proposed this change. Some were in favor of dropping it back to 
monthly. [~tilman] made the argument for the benefit of seeing problems quickly 
and also acknowledged that it is a burden to merge the daily PRs.

I propose bumping dependabot back to weekly for a bit, and we'll see how it 
works as a middle ground.

If anyone feels strongly about moving back to daily, we can do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-04-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4233:
--
Fix Version/s: (was: 3.0.0)

> Check tika-helm for deprecated k8s APIs
> ---
>
> Key: TIKA-4233
> URL: https://issues.apache.org/jira/browse/TIKA-4233
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
> this would be ideal. The “Check deprecated k8s APIs” GitHub action 
> accomplishes this.
> [https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4232) Create and execute unit tests for tika-helm

2024-04-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4232:
--
Fix Version/s: (was: 3.0.0)

> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4219) Figure out what to do with epubs with encrypted non-core content

2024-04-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4219.
---
Fix Version/s: 2.9.2
   Resolution: Fixed

> Figure out what to do with epubs with encrypted non-core content
> 
>
> Key: TIKA-4219
> URL: https://issues.apache.org/jira/browse/TIKA-4219
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.9.2
>
>
> On TIKA-4218, we noticed several epubs that were now being identified as 
> encrypted, which is good. We did this work on TIKA-4176.
> On the other hand, we found several epubs that were now identified as 
> encrypted but which had content before we were doing the encryption detection.
> The issue in at least one file that I reviewed is that non-core content is 
> encrypted -- the fonts. So, from a text+metadata extraction, we could still 
> get all the content and then throw an Encrypted Exception or maybe flag 
> something as encrypted.
> I'm not sure what the best thing to do is in this case.
> An example file is here: 
> http://corpora.tika.apache.org/base/docs/commoncrawl3/47/47WOSBEUHE6CRMVDFBOOHUD36FEQAZ6T



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-04-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4233:
--
Fix Version/s: 3.0.0
   (was: 2.9.2)

> Check tika-helm for deprecated k8s APIs
> ---
>
> Key: TIKA-4233
> URL: https://issues.apache.org/jira/browse/TIKA-4233
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 3.0.0
>
>
> It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
> this would be ideal. The “Check deprecated k8s APIs” GitHub action 
> accomplishes this.
> [https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4232) Create and execute unit tests for tika-helm

2024-04-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4232:
--
Fix Version/s: 3.0.0
   (was: 2.9.2)

> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 3.0.0
>
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4235) Add pipeline parameter to OpenSearch emitter

2024-04-04 Thread Tim Allison (Jira)
Tim Allison created TIKA-4235:
-

 Summary: Add pipeline parameter to OpenSearch emitter
 Key: TIKA-4235
 URL: https://issues.apache.org/jira/browse/TIKA-4235
 Project: Tika
  Issue Type: New Feature
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4234) Further improvements to jdbc pipes reporter

2024-04-04 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4234.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Further improvements to jdbc pipes reporter
> ---
>
> Key: TIKA-4234
> URL: https://issues.apache.org/jira/browse/TIKA-4234
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Allow users to set the table name.
> Allow users to choose whether or not to drop+create the table via the 
> reporter or whether they're responsible for creating the table.
> Allow users to configure insert/upsert/update. The default is "insert id, 
> status, timestamp".
> This and the earlier jdbc reporter introduce breaking changes and will only 
> be applied to 3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4234) Further improvements to jdbc pipes reporter

2024-04-04 Thread Tim Allison (Jira)
Tim Allison created TIKA-4234:
-

 Summary: Further improvements to jdbc pipes reporter
 Key: TIKA-4234
 URL: https://issues.apache.org/jira/browse/TIKA-4234
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison


Allow users to set the table name.

Allow users to choose whether or not to drop+create the table via the reporter 
or whether they're responsible for creating the table.

Allow users to configure insert/upsert/update. The default is "insert id, 
status, timestamp".

This and the earlier jdbc reporter introduce breaking changes and will only be 
applied to 3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833745#comment-17833745
 ] 

Tim Allison commented on TIKA-4231:
---

On some PDFs, there can be problems with Unicode mappings and other glyph 
issues. For some of these files, they render well but the underlying electronic 
text is junk. In those cases, OCR is the best option.

I haven’t looked at this pdf and don’t know if the above is the case for this 
one.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833745#comment-17833745
 ] 

Tim Allison edited comment on TIKA-4231 at 4/3/24 9:18 PM:
---

On some PDFs, there can be problems with Unicode mappings and other glyph/font 
issues. For some of these files, they render well but the underlying electronic 
text is junk. In those cases, OCR is the best option.

I haven’t looked at this pdf and don’t know if the above is the case for this 
one.


was (Author: talli...@mitre.org):
On some PDFs, there can be problems with Unicode mappings and other glyph 
issues. For some of these files, they render well but the underlying electronic 
text is junk. In those cases, OCR is the best option.

I haven’t looked at this pdf and don’t know if the above is the case for this 
one.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-02 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833344#comment-17833344
 ] 

Tim Allison commented on TIKA-4231:
---

If you run Poppler's pdftotext against the file or copy and paste out of Adobe 
Reader into a text file, do you get higher quality text?

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

2024-03-28 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4207.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> PipesParser should have option to extract raw bytes of embedded files
> -
>
> Key: TIKA-4207
> URL: https://issues.apache.org/jira/browse/TIKA-4207
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Major
> Fix For: 3.0.0
>
>
> There are many use cases, where text+metadata are important, but users also 
> need the raw bytes from embedded files.
> Let's make it possible to extract the usual rmeta content in _and_ the raw 
> bytes. This is a preliminary step that will offer more customization options 
> than the proposal in TIKA-3703.
> This is targeted to 3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

2024-03-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831815#comment-17831815
 ] 

Tim Allison commented on TIKA-4207:
---

There are some areas for simplification, but I think this is good enough to go 
for now.

> PipesParser should have option to extract raw bytes of embedded files
> -
>
> Key: TIKA-4207
> URL: https://issues.apache.org/jira/browse/TIKA-4207
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Major
>
> There are many use cases, where text+metadata are important, but users also 
> need the raw bytes from embedded files.
> Let's make it possible to extract the usual rmeta content in _and_ the raw 
> bytes. This is a preliminary step that will offer more customization options 
> than the proposal in TIKA-3703.
> This is targeted to 3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831549#comment-17831549
 ] 

Tim Allison commented on TIKA-4228:
---

Sometimes the operating system kills a process. When that happens, the user 
experiences a crash with no logs from the java application, but the operating 
system logs that it killed a process. See the link above.

Are you running multithreaded? Can you tell what the exitvalue is from the 
process?

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = inputFile.toUri().toURL();
>    

[jira] [Comment Edited] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831524#comment-17831524
 ] 

Tim Allison edited comment on TIKA-4228 at 3/27/24 8:29 PM:


What's the exit code? -Are you on a system with an oom killer or other process 
killer-, and if so, do the logs suggest that the OS killed the process?

Sorry, ubuntu, right. Anything in the logs? 
https://www.baeldung.com/linux/what-killed-a-process


was (Author: talli...@mitre.org):
What's the exit code? Are you on a system with an oom killer or other process 
killer, and if so, what do its logs say?

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     

[jira] [Comment Edited] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831518#comment-17831518
 ] 

Tim Allison edited comment on TIKA-4228 at 3/27/24 8:25 PM:


As I think about it, this code wouldn't extract all of the embedded images in 
the PDF...so that's not a concern...you'd have to turn on extractInlineImages.

I can run getMetadata() with -Xmx256m with no problems with the current 
branch_2x.

If I roll back to PDFBox 2.0.29, which we used in Tika 2.9.0 and run Java 
corretto 21, I'm still not able to repro any crashes with metadata or file 
extract even if I multithread it and run continuous loops.


was (Author: talli...@mitre.org):
As I think about it, this code wouldn't extract all of the embedded images in 
the PDF...so that's not a concern...you'd have to turn on extractInlineImages.

I can run getMetadata() with -Xmx256m with no problems with the current 
branch_2x.

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     

[jira] [Commented] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831524#comment-17831524
 ] 

Tim Allison commented on TIKA-4228:
---

What's the exit code? Are you on a system with an oom killer or other process 
killer, and if so, what do its logs say?

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = inputFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata))
> {     ContentHandler handler = new DefaultHandler();     
> 

[jira] [Comment Edited] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831520#comment-17831520
 ] 

Tim Allison edited comment on TIKA-4228 at 3/27/24 7:59 PM:


Unsolicited advice: 
* I'd encourage you to open your InputStream with TikaInputStream.get(file). 
* I would not recommend creating a new TikaConfig for each embedded file. That 
will be horribly inefficient on files where you do run into embedded files.
* I'd encourage you to use some of the safer methods so that a parser+jvm crash 
isn't a problem for you: 
https://cwiki.apache.org/confluence/display/TIKA/The+Robustness+of+Apache+Tika

Finally, see TIKA-4207 for a new capability that will extract both embedded 
bytes and content+metadata.


was (Author: talli...@mitre.org):
As a side note, I'd encourage you to open your InputStream with 
TikaInputStream.get(file). Further, I would not recommend creating a new 
TikaConfig for each embedded file. That will be horribly inefficient on files 
where you do run into embedded files.

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     

[jira] [Commented] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831520#comment-17831520
 ] 

Tim Allison commented on TIKA-4228:
---

As a side note, I'd encourage you to open your InputStream with 
TikaInputStream.get(file). Further, I would not recommend creating a new 
TikaConfig for each embedded file. That will be horribly inefficient on files 
where you do run into embedded files.

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = inputFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream 

[jira] [Comment Edited] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831518#comment-17831518
 ] 

Tim Allison edited comment on TIKA-4228 at 3/27/24 7:57 PM:


As I think about it, this code wouldn't extract all of the embedded images in 
the PDF...so that's not a concern...you'd have to turn on extractInlineImages.

I can run getMetadata() with -Xmx256m with no problems with the current 
branch_2x.


was (Author: talli...@mitre.org):
As I think about it, this code wouldn't extract all of the embedded images in 
the PDF...so that's not a concern...you'd have to turn on extractInlineImages.

I can run getMetadata() with -Xmx256m with no problems.

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     

[jira] [Commented] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831518#comment-17831518
 ] 

Tim Allison commented on TIKA-4228:
---

As I think about it, this code wouldn't extract all of the embedded images in 
the PDF...so that's not a concern...you'd have to turn on extractInlineImages.

I can run getMetadata() with -Xmx256m with no problems.

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = inputFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata))
> 

[jira] [Commented] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831513#comment-17831513
 ] 

Tim Allison commented on TIKA-4228:
---

I was able to extract the images with the -z option and the text+metadata with 
-J -t, both with tika-app, and both with -Xmx256 with Tika 2.9.0... if that 
means anything.

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = inputFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata))
> {     ContentHandler handler = 

[jira] [Commented] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831512#comment-17831512
 ] 

Tim Allison commented on TIKA-4228:
---

I'm guessing "crashing" means an OOM?

How much memory are you giving your process? How many threads run in that 
amount of memory?

Can you share the stacktraces? Can you repro this with the -z or "-J -t" 
options with tika-app?

There are a lot of images, but they're all small.

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = inputFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>  

[jira] [Commented] (TIKA-4152) Fix tika as a service

2024-03-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831017#comment-17831017
 ] 

Tim Allison commented on TIKA-4152:
---

[~epugh] any chance you might have a chance to look into this? Totally 
understand if not. Thank you.

> Fix tika as a service
> -
>
> Key: TIKA-4152
> URL: https://issues.apache.org/jira/browse/TIKA-4152
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> We've gotten two reports on the user list in the last month or so on the tika 
> as a service scripts no longer working.
> We should fix this.
> https://lists.apache.org/thread/mnf3pxlmvdy456v4s2b8r7mv3khl3msk
> https://lists.apache.org/thread/ozkrrvbwc0bvqmqb9zc4xofhnd3djqz1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >