[GitHub] [tika] dependabot[bot] opened a new pull request #527: Bump dl4j.version from 1.0.0-M1.1 to 1.0.0-M2

2022-03-09 Thread GitBox


dependabot[bot] opened a new pull request #527:
URL: https://github.com/apache/tika/pull/527


   Bumps `dl4j.version` from 1.0.0-M1.1 to 1.0.0-M2.
   Updates `datavec-data-image` from 1.0.0-M1.1 to 1.0.0-M2
   
   Updates `deeplearning4j-zoo` from 1.0.0-M1.1 to 1.0.0-M2
   
   Updates `deeplearning4j-modelimport` from 1.0.0-M1.1 to 1.0.0-M2
   
   Updates `deeplearning4j-nn` from 1.0.0-M1.1 to 1.0.0-M2
   
   Updates `nd4j-api` from 1.0.0-M1.1 to 1.0.0-M2
   
   Updates `nd4j-native-platform` from 1.0.0-M1.1 to 1.0.0-M2
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-09 Thread Naama Hophstatder (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504036#comment-17504036
 ] 

Naama Hophstatder commented on TIKA-3684:
-

I don't know how should I configure the service as I'm running it locally, not 
in a docker container.

The docs just speaks about 2.0, so can you help us in configuring local 1.24 
tika-server as a linux service?

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503774#comment-17503774
 ] 

Hudson commented on TIKA-3694:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #483 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/483/])
TIKA-3694 - trivial checkstyle fixes (tallison: 
[https://github.com/apache/tika/commit/43b1284cf74467a94e3064108ab661fdcfa4d8a0])
* (edit) 
tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaMimeTypesTest.java
* (edit) 
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java


> Tika Server endpoint to return more details on a mime type
> --
>
> Key: TIKA-3694
> URL: https://issues.apache.org/jira/browse/TIKA-3694
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, server
>Affects Versions: 2.3.0
>Reporter: Nick Burch
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.3.1
>
>
> As raised on the user list - 
> [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users 
> calling the Java APIs are able to get additional details on a mime type, such 
> as common extensions and descriptions. Those calling the Tika Server can only 
> get limited information on mime types, such as which are known to Tika
> In addition to the current {{/mime-types}} endpoint (html/json/text), we 
> should add a more detailed one that takes a specific type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (TIKA-3697) Add parser for warc files

2022-03-09 Thread Tim Allison (Jira)
Tim Allison created TIKA-3697:
-

 Summary: Add parser for warc files
 Key: TIKA-3697
 URL: https://issues.apache.org/jira/browse/TIKA-3697
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


netpreserve's jwarc is ASL 2.0, fairly small and no dependencies.

Should we add this into tika-parsers-standard or create a separate package for 
it in tika-parsers-extended?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (TIKA-3696) Add detection for wacz files

2022-03-09 Thread Tim Allison (Jira)
Tim Allison created TIKA-3696:
-

 Summary: Add detection for wacz files
 Key: TIKA-3696
 URL: https://issues.apache.org/jira/browse/TIKA-3696
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


https://webrecorder.github.io/wacz-spec/1.2.0/

Zip file with standard entries: 'archive', 'datapackage.json', 'indexes' and 
'pages'.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3695) LimitingMetadataFilter

2022-03-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503723#comment-17503723
 ] 

Tim Allison commented on TIKA-3695:
---

Maybe extend Metadata as SecureMetadata and put the logic there with 
configuration via AutoDetectParser?  That won't work easily because parsers 
create new Metadata objects for embedded files. 

Do we need to pass around a MetadataFactory or MetadataWriter?  That'll still 
require a bunch of changes, but would be cleaner?

> LimitingMetadataFilter
> --
>
> Key: TIKA-3695
> URL: https://issues.apache.org/jira/browse/TIKA-3695
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata
>Affects Versions: 1.28.1, 2.3.0
>Reporter: Julien Massiera
>Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the 
> metadata values, the metadata names, but also for the total amount of 
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can 
> filter out the metadata according to different bytes limits (on metadata 
> names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-09 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3694.
---
Resolution: Fixed

trivial fixes made.  Thank you, again, [~nick]!

> Tika Server endpoint to return more details on a mime type
> --
>
> Key: TIKA-3694
> URL: https://issues.apache.org/jira/browse/TIKA-3694
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, server
>Affects Versions: 2.3.0
>Reporter: Nick Burch
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.3.1
>
>
> As raised on the user list - 
> [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users 
> calling the Java APIs are able to get additional details on a mime type, such 
> as common extensions and descriptions. Those calling the Tika Server can only 
> get limited information on mime types, such as which are known to Tika
> In addition to the current {{/mime-types}} endpoint (html/json/text), we 
> should add a more detailed one that takes a specific type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Reopened] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-09 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-3694:
---
  Assignee: Tim Allison

Fixing checkstyle issues

> Tika Server endpoint to return more details on a mime type
> --
>
> Key: TIKA-3694
> URL: https://issues.apache.org/jira/browse/TIKA-3694
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, server
>Affects Versions: 2.3.0
>Reporter: Nick Burch
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.3.1
>
>
> As raised on the user list - 
> [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users 
> calling the Java APIs are able to get additional details on a mime type, such 
> as common extensions and descriptions. Those calling the Tika Server can only 
> get limited information on mime types, such as which are known to Tika
> In addition to the current {{/mime-types}} endpoint (html/json/text), we 
> should add a more detailed one that takes a specific type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3695) LimitingMetadataFilter

2022-03-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503661#comment-17503661
 ] 

Tim Allison edited comment on TIKA-3695 at 3/9/22, 4:01 PM:


On the list, I suggested implementing this as a MetadataFilter.  These are used 
by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in 
tika-app), and they are triggered after the parse of the file. 

On further thought, I'm not sure this is the best option for two reasons.  The 
metadata would be extracted and stored in the Metadata object within Tika, but 
then truncated/removed before returning the data to the user...so the data will 
still be in memory and will consume Tika memory resources until the file's 
parsing is finished.  Also, this solution will not play well with the 
traditional xhtml output of /tika where whatever is in the metadata object is 
written when the parser hits the first bit of content text, not after the parse 
has concluded.

I'm wondering if we need to put these protections deeper into the Metadata 
object itself so that it isn't storing this info and then removing it.  

If a parser tries to write too much metadata, do we throw a WriteLimitException 
and stop parsing, or do we keep parsing but add a "metadata truncation" flag to 
the metadata object?  I'd be inclined to the latter.

How do we parameterize the limits? New parameters on AutoDetectParser?  New 
MetadataWriter class (yikes...).

Thoughts?


was (Author: talli...@mitre.org):
On the list, I suggested implementing this as a MetadataFilter.  These are used 
by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in 
tika-app).  They are triggered after the parse of the file. 

On further thought, I'm not sure this is the best option for two reason.  The 
metadata would be extracted and stored in the Metadata object within Tika, but 
then truncated/removed before returning the data to the user...so the data will 
still be in memory and will consume Tika memory resources until the file's 
parsing is finished.  Also,  this solution will not play well with the 
traditional xhtml output of /tika where whatever is in the metadata object is 
written when the parser hits the first bit of content text, not after the parse 
has concluded.

I'm wondering if we need to put these protections deeper into the Metadata 
object itself so that it isn't storing this info and then removing it.  

If a parser tries to write too much metadata, do we throw a WriteLimitException 
and stop parsing, or do we keep parsing but add a "metadata truncation" flag to 
the metadata object?  I'd be inclined to the latter.

Thoughts?  How do we configure it...new parameters on AutoDetectParser? 

> LimitingMetadataFilter
> --
>
> Key: TIKA-3695
> URL: https://issues.apache.org/jira/browse/TIKA-3695
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata
>Affects Versions: 1.28.1, 2.3.0
>Reporter: Julien Massiera
>Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the 
> metadata values, the metadata names, but also for the total amount of 
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can 
> filter out the metadata according to different bytes limits (on metadata 
> names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3695) LimitingMetadataFilter

2022-03-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503661#comment-17503661
 ] 

Tim Allison edited comment on TIKA-3695 at 3/9/22, 3:47 PM:


On the list, I suggested implementing this as a MetadataFilter.  These are used 
by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in 
tika-app).  They are triggered after the parse of the file. 

On further thought, I'm not sure this is the best option for two reason.  The 
metadata would be extracted and stored in the Metadata object within Tika, but 
then truncated/removed before returning the data to the user...so the data will 
still be in memory and will consume Tika memory resources until the file's 
parsing is finished.  Also,  this solution will not play well with the 
traditional xhtml output of /tika where whatever is in the metadata object is 
written when the parser hits the first bit of content text, not after the parse 
has concluded.

I'm wondering if we need to put these protections deeper into the Metadata 
object itself so that it isn't storing this info and then removing it.  

If a parser tries to write too much metadata, do we throw a WriteLimitException 
and stop parsing, or do we keep parsing but add a "metadata truncation" flag to 
the metadata object?  I'd be inclined to the latter.

Thoughts?  How do we configure it...new parameters on AutoDetectParser? 


was (Author: talli...@mitre.org):
On the list, I suggested implementing this as a MetadataFilter.  These are used 
by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in 
tika-app).  They are triggered after the parse of the file. 

On further thought, I'm not sure this is the best option for two reason.  The 
metadata would be extracted and stored in the Metadata object within Tika, but 
then truncated/removed before returning the data to the user...so the data will 
still be in memory and will consume Tika memory resources until the file's 
parsing is finished.  Also,  this solution will not play well with the 
traditional xhtml output of /tika where whatever is in the metadata object is 
written when the parser hits the first bit of content text, not after the parse 
has concluded.

I'm wondering if we need to put these protections deeper into the Metadata 
object itself so that it isn't storing this info and then removing it.

Thoughts?  How do we configure it...new parameters on AutoDetectParser? 

> LimitingMetadataFilter
> --
>
> Key: TIKA-3695
> URL: https://issues.apache.org/jira/browse/TIKA-3695
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata
>Affects Versions: 1.28.1, 2.3.0
>Reporter: Julien Massiera
>Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the 
> metadata values, the metadata names, but also for the total amount of 
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can 
> filter out the metadata according to different bytes limits (on metadata 
> names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [tika] tballison merged pull request #526: Bump aws.version from 1.12.169 to 1.12.174

2022-03-09 Thread GitBox


tballison merged pull request #526:
URL: https://github.com/apache/tika/pull/526


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [tika] tballison merged pull request #525: Bump gson from 2.8.9 to 2.9.0

2022-03-09 Thread GitBox


tballison merged pull request #525:
URL: https://github.com/apache/tika/pull/525


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [tika] tballison commented on pull request #525: Bump gson from 2.8.9 to 2.9.0

2022-03-09 Thread GitBox


tballison commented on pull request #525:
URL: https://github.com/apache/tika/pull/525#issuecomment-1063055561


   Ah, TIKA-3694. https://github.com/apache/tika/actions/runs/1948546338


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503672#comment-17503672
 ] 

Tim Allison commented on TIKA-3694:
---

Looks like checkstyle is not happy with these commits.

> Tika Server endpoint to return more details on a mime type
> --
>
> Key: TIKA-3694
> URL: https://issues.apache.org/jira/browse/TIKA-3694
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, server
>Affects Versions: 2.3.0
>Reporter: Nick Burch
>Priority: Major
> Fix For: 2.3.1
>
>
> As raised on the user list - 
> [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users 
> calling the Java APIs are able to get additional details on a mime type, such 
> as common extensions and descriptions. Those calling the Tika Server can only 
> get limited information on mime types, such as which are known to Tika
> In addition to the current {{/mime-types}} endpoint (html/json/text), we 
> should add a more detailed one that takes a specific type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503672#comment-17503672
 ] 

Tim Allison edited comment on TIKA-3694 at 3/9/22, 3:41 PM:


[~nick]. Thank you for doing this!  Looks like checkstyle is not happy with 
these commits.


was (Author: talli...@mitre.org):
Looks like checkstyle is not happy with these commits.

> Tika Server endpoint to return more details on a mime type
> --
>
> Key: TIKA-3694
> URL: https://issues.apache.org/jira/browse/TIKA-3694
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, server
>Affects Versions: 2.3.0
>Reporter: Nick Burch
>Priority: Major
> Fix For: 2.3.1
>
>
> As raised on the user list - 
> [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users 
> calling the Java APIs are able to get additional details on a mime type, such 
> as common extensions and descriptions. Those calling the Tika Server can only 
> get limited information on mime types, such as which are known to Tika
> In addition to the current {{/mime-types}} endpoint (html/json/text), we 
> should add a more detailed one that takes a specific type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [tika] tballison commented on pull request #525: Bump gson from 2.8.9 to 2.9.0

2022-03-09 Thread GitBox


tballison commented on pull request #525:
URL: https://github.com/apache/tika/pull/525#issuecomment-1063052166


   I have no idea how checkstyle failed on this one...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [tika] tballison merged pull request #524: Bump jackson.version from 2.13.1 to 2.13.2

2022-03-09 Thread GitBox


tballison merged pull request #524:
URL: https://github.com/apache/tika/pull/524


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (TIKA-3695) LimitingMetadataFilter

2022-03-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503661#comment-17503661
 ] 

Tim Allison edited comment on TIKA-3695 at 3/9/22, 3:36 PM:


On the list, I suggested implementing this as a MetadataFilter.  These are used 
by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in 
tika-app).  They are triggered after the parse of the file. 

On further thought, I'm not sure this is the best option for two reason.  The 
metadata would be extracted and stored in the Metadata object within Tika, but 
then truncated/removed before returning the data to the user...so the data will 
still be in memory and will consume Tika memory resources until the file's 
parsing is finished.  Also,  this solution will not play well with the 
traditional xhtml output of /tika where whatever is in the metadata object is 
written when the parser hits the first bit of content text, not after the parse 
has concluded.

I'm wondering if we need to put these protections deeper into the Metadata 
object itself so that it isn't storing this info and then removing it.

Thoughts?  How do we configure it...new parameters on AutoDetectParser? 


was (Author: talli...@mitre.org):
On the list, I suggested implementing this as a MetadataFilter.  These are used 
by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in 
tika-app).  They are triggered after the parse of the file. 

In this case, the metadata would be extracted and stored in the Metadata object 
within Tika, but then truncated/removed before returning the data to the user.  
This solution will not play well with the traditional xhtml output of /tika 
where whatever is in the metadata object is written when the parser hits the 
first bit of content text.

I'm wondering if we need to put these protections deeper into the Metadata 
object itself so that it isn't storing this info and then removing it.

Thoughts?  How do we configure it...new parameters on AutoDetectParser?  

> LimitingMetadataFilter
> --
>
> Key: TIKA-3695
> URL: https://issues.apache.org/jira/browse/TIKA-3695
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata
>Affects Versions: 1.28.1, 2.3.0
>Reporter: Julien Massiera
>Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the 
> metadata values, the metadata names, but also for the total amount of 
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can 
> filter out the metadata according to different bytes limits (on metadata 
> names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3695) LimitingMetadataFilter

2022-03-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503661#comment-17503661
 ] 

Tim Allison commented on TIKA-3695:
---

On the list, I suggested implementing this as a MetadataFilter.  These are used 
by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in 
tika-app).  They are triggered after the parse of the file. 

In this case, the metadata would be extracted and stored in the Metadata object 
within Tika, but then truncated/removed before returning the data to the user.  
This solution will not play well with the traditional xhtml output of /tika 
where whatever is in the metadata object is written when the parser hits the 
first bit of content text.

I'm wondering if we need to put these protections deeper into the Metadata 
object itself so that it isn't storing this info and then removing it.

Thoughts?  How do we configure it...new parameters on AutoDetectParser?  

> LimitingMetadataFilter
> --
>
> Key: TIKA-3695
> URL: https://issues.apache.org/jira/browse/TIKA-3695
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata
>Affects Versions: 1.28.1, 2.3.0
>Reporter: Julien Massiera
>Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the 
> metadata values, the metadata names, but also for the total amount of 
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can 
> filter out the metadata according to different bytes limits (on metadata 
> names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503637#comment-17503637
 ] 

Tim Allison commented on TIKA-3684:
---

That configuration should work with 1.24 as well.  Is it not working for you?

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-09 Thread Naama Hophstatder (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503397#comment-17503397
 ] 

Naama Hophstatder commented on TIKA-3684:
-

Hi [~tallison] , could you help us using the config file you attached to the 
production tika-server?

We use version 1.24 as a Linux service, and I'm not sure of the correct way to 
do it.

Thanks again!

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)