from:"Nick Burch \(JIRA\)"

[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842403#comment-17842403
 ] 

Nick Burch commented on TIKA-4249:
--

I'd probably say we change the 0="From:" into "0=From" or "0=(UTF-8-BOM)From:", 
should be a little less likely to have false positives that way

First time I've come across a Byte Order Mark at the start of an email file 
though!

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Attachments: Email_Received.txt
>
>
> We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4223) STL file exported with OpenSCAD not detected correctly

2024-03-26 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830867#comment-17830867
 ] 

Nick Burch commented on TIKA-4223:
--

A lot of the early file extension allocations were taken from the HTTPD mime 
magics, which for obscure formats is unlikely to be representative of use 
today. So, for something like this, I'm +1 to moving the glob to a more 
common/popular format that also shares the same extension

> STL file exported with OpenSCAD not detected correctly
> --
>
> Key: TIKA-4223
> URL: https://issues.apache.org/jira/browse/TIKA-4223
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.9.1
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude_ascii.stl, linear_extrude_binary.stl
>
>
> STL files can be in ASCII or in binary format. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into STL the ASCII result file is detected as text/plain.
> Also the binary STL is detected with application/vnd.ms-pki.stl which differs 
> from the model/stl mime-type Wikipedia lists for those files.
>  
> Used commands for attached files
> {code:java}
> openscad.exe --export-format asciistl -o result\linear_extrude_ascii.stl 
> examples\Basics\linear_extrude.scad {code}
> {code:java}
> openscad.exe --export-format binstl -o result\linear_extrude_binary.stl 
> examples\Basics\linear_extrude.scad
> {code}
> Refs:
> https://en.wikipedia.org/wiki/STL_(file_format)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4210) Not able to identify tika extension

2024-03-14 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827017#comment-17827017
 ] 

Nick Burch commented on TIKA-4210:
--

The attached file seems to be an RTF file. I'm not sure what a ".mega 
attachment" is, but this file doesn't seem to be one of them...

tika-app-2.9.1.jar is able to correctly identify this file as RTF

> Not able to identify tika extension
> ---
>
> Key: TIKA-4210
> URL: https://issues.apache.org/jira/browse/TIKA-4210
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: sample.DOC
>
>
> Hi Team,
> The attached embedded file contain .mega attachments which tika is  not able 
> to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still 
> showing it as empty. Please look into this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-09 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824965#comment-17824965
 ] 

Nick Burch commented on TIKA-4208:
--

I would expect that the json output version would need a bit more memory, as 
we'll have to hold all the content in memory before outputting instead of just 
streaming the text/html out as we go along. I wouldn't expect it to be 4gb vs 
32gb though!

Any ideas anyone? Is it possible we've got an extra layer (or 2?) of buffering 
above and beyond what we need for the {{-J}} option?

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-08 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824874#comment-17824874
 ] 

Nick Burch commented on TIKA-4208:
--

How much heap size do you have allocated?

The error suggests that Tika managed to decode the string in the SAS data file, 
but ran out of memory passing the string through the content handler stack to 
plain text. Generally things break at the decode step if they're going to, 
rather than the output!

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-12 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816788#comment-17816788
 ] 

Nick Burch commented on TIKA-3784:
--

>From [https://datatracker.ietf.org/doc/rfc7292/] it looks like PKCS12 is based 
>on PKCS7, so that's expected. There's a few more types defined in 
>[https://www.rfc-editor.org/rfc/rfc7292.html#appendix-D] - not sure if we can 
>find any of those to match on?

Thought [https://www.cs.auckland.ac.nz/~pgut001/pubs/pfx.html] does suggest 
this isn't an ideal format...

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
> Attachments: dump_p12s.txt
>
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4148) Support Autodesk Inventor files (.ipt) (.iam) (.ipn) (.idw)

2023-11-19 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787608#comment-17787608
 ] 

Nick Burch commented on TIKA-4148:
--

For detection of the OLE2 based files, we don't need to find unique byte 
combinations, we only need to find unique OLE2 entry names / sets of names

See 
[https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/POIFSContainerDetector.java#L362]
 for an example of "must have this then one of those"

If you can run POIFSLister (and/or POIFSDumper) on a bunch of files, and spot 
the entry names that are common (+ ideally not already in POIFSContainerDector 
for other ones), that's what we need

> Support Autodesk Inventor files (.ipt) (.iam) (.ipn) (.idw)
> ---
>
> Key: TIKA-4148
> URL: https://issues.apache.org/jira/browse/TIKA-4148
> Project: Tika
>  Issue Type: Improvement
>Reporter: Alexey Pismenskiy
>Priority: Major
>
> Add support for Autodesk Inventor files in Tika. 
> Examples of the files can be downloaded from 
> [https://www.autodesk.com/support/technical/article/caas/tsarticles/ts/3gnm93P9sPAWE6vndk7fjq.html]
> It would be great to start at least at the metadata level and then add 
> content parsing later. 
> I suspect I would be something similar to 
> [DWGParser|[https://tika.apache.org/0.9/api/org/apache/tika/parser/dwg/DWGParser.html]|https://tika.apache.org/0.9/api/org/apache/tika/parser/dwg/DWGParser.html].],
>  
> any suggestions where to start looking are appreciated. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-4119) Return media type "text/javascript" instead of "application/javascript to follow RFC-9239

2023-09-05 Thread Nick Burch (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-4119:
-
Component/s: mime

> Return media type "text/javascript" instead of "application/javascript to 
> follow RFC-9239
> -
>
> Key: TIKA-4119
> URL: https://issues.apache.org/jira/browse/TIKA-4119
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Reporter: Matthias Juchmes
>Priority: Major
>  Labels: tika-3x
>
> [RFC-9239|https://www.rfc-editor.org/rfc/rfc9239.html] obsoletes some 
> javascript media types, including "application/javascript", which is 
> currently returned by Tika for javascript files. "text/javascript" is defined 
> as the most widely supported one, so Tika should reflect this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-4119) Return media type "text/javascript" instead of "application/javascript to follow RFC-9239

2023-09-05 Thread Nick Burch (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-4119:
-
Labels: tika-3x  (was: )

> Return media type "text/javascript" instead of "application/javascript to 
> follow RFC-9239
> -
>
> Key: TIKA-4119
> URL: https://issues.apache.org/jira/browse/TIKA-4119
> Project: Tika
>  Issue Type: Improvement
>Reporter: Matthias Juchmes
>Priority: Major
>  Labels: tika-3x
>
> [RFC-9239|https://www.rfc-editor.org/rfc/rfc9239.html] obsoletes some 
> javascript media types, including "application/javascript", which is 
> currently returned by Tika for javascript files. "text/javascript" is defined 
> as the most widely supported one, so Tika should reflect this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4119) Return media type "text/javascript" instead of "application/javascript to follow RFC-9239

2023-08-29 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759921#comment-17759921
 ] 

Nick Burch commented on TIKA-4119:
--

I wonder if this is a big enough change around Detection that we ought to wait 
for 3.x to make it. Thoughts anyone?

(We already define {{text/javascript}} as an alias for the type, so users can 
already define parsers etc for the text variant, but swapping the canonical and 
the alias is going to break a lot of detection uses if people don't update)

> Return media type "text/javascript" instead of "application/javascript to 
> follow RFC-9239
> -
>
> Key: TIKA-4119
> URL: https://issues.apache.org/jira/browse/TIKA-4119
> Project: Tika
>  Issue Type: Improvement
>Reporter: Matthias Juchmes
>Priority: Major
>
> [RFC-9239|https://www.rfc-editor.org/rfc/rfc9239.html] obsoletes some 
> javascript media types, including "application/javascript", which is 
> currently returned by Tika for javascript files. "text/javascript" is defined 
> as the most widely supported one, so Tika should reflect this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4062) OfflineContentHandler/ContentHandlerDecorator does not provide option for custom error handling

2023-08-02 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750344#comment-17750344
 ] 

Nick Burch commented on TIKA-4062:
--

Between holidays and the length of time needed for regression runs + votes, I 
suspect late August / early September

> OfflineContentHandler/ContentHandlerDecorator does not provide option for 
> custom error handling
> ---
>
> Key: TIKA-4062
> URL: https://issues.apache.org/jira/browse/TIKA-4062
> Project: Tika
>  Issue Type: Bug
>  Components: tika-core
>Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.7.0, 2.8.0
>Reporter: Ravi Ranjan Jha
>Priority: Critical
> Fix For: 2.8.1
>
>
> OfflineContentHandler/ContentHandlerDecorator does not provide option for 
> custom error handling
> Prior to the change of passing OfflineContentHandler to SAX Parser in 
> XMLReaderUtils.parseSAX, one could pass a custom ContentHandlerDecorator to 
> handle exception or override error/warning etc methods. The same is not 
> possible now because the default impl for handleException in the 
> OfflineContentHandler's parent ContentHandlerDecorator just throws exception 
> as shown below:
>  
>  protected void handleException(SAXException exception) throws SAXException {
>         throw exception;
>     }
>  
> which could probably be (at minimum)
> public void handleException(SAXException exception) throws SAXException {
>         handler.handleException(exception);
>     }
>  
> This is breaking our app's behavior. Please take it as priority.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4064) Update to 2.8.1

2023-07-28 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17748454#comment-17748454
 ] 

Nick Burch commented on TIKA-4064:
--

Depends if anyone else on the PMC has the time to be release manager for it 
(sadly I don't). If we're relying on TIm once more, I suspect early September, 
as Tim's busy for a few weeks before he could start the release process going

> Update to 2.8.1
> ---
>
> Key: TIKA-4064
> URL: https://issues.apache.org/jira/browse/TIKA-4064
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.8.0
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 2.8.1
>
>
> The latest maven versions plugin finds much more outdated stuff than the 
> previous one, so I'll do a few updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3948) Require Java 11 in 3.x

2023-07-28 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17748452#comment-17748452
 ] 

Nick Burch commented on TIKA-3948:
--

[~solomax] I think the first task is to identify any other areas of Tika that 
will be affected by the switch. That may be an explicit dependency, but I fear 
it's more likely to be things a long way down the dependency tree in something 
(probably one of the scientific parsers with more sporadic updates). 

Once we know all the places that'll be affected, then we can come up with a 
plan for any changes needed directly in Tika, and a plan for any dependencies 
which need updates but where upstream haven't/won't do the matching ones. And 
then we can think about a preview release :)

> Require Java 11 in 3.x
> --
>
> Key: TIKA-3948
> URL: https://issues.apache.org/jira/browse/TIKA-3948
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>  Labels: tika-3x
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4098) Detection fails on PDF with garbage before header

2023-07-10 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741578#comment-17741578
 ] 

Nick Burch commented on TIKA-4098:
--

The more bytes beyond the start we check for the PDF marker, the more likely we 
are to mis-identify a different file as a PDF. The %PDF- marker is pretty 
unique at the start of a file, but progressively less so as the content 
continues. (Consider a markdown file of a talk on file formats, that could 
easily have the text "Look for %PDF- at the start" on page 10 and we don't want 
to mark the whole thing as a PDF!)

If you know for sure that a file is a PDF, just skip detection and tell Tika 
and we'll hand it off to the PDF parser!

If your use case has very few text-based formats, you can fairly safely bump 
the search window up. Out-of-the-box, I'd be very worried to push it much 
further due to the false positive risk

> Detection fails on PDF with garbage before header
> -
>
> Key: TIKA-4098
> URL: https://issues.apache.org/jira/browse/TIKA-4098
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.8.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: garbageBeforeHeader.pdf
>
>
> PDF detection fails on files that contain too much garbage before the header 
> 'PDF%-'.
> Those PDFs do not respect the specification, but are nonetheless correctly 
> handled by PDF viewers.
> The joined PDF is an example on the garbage found in a real-life PDF (looks 
> like email headers that 'leaked' onto the PDF file). The PDF itself is one 
> that I generated so that the exemple si small.
> The current magic for PDFs  limits the search for the '%PDF-%' header to 512 
> bytes, and in the joined PDF it's located after 702 garbage bytes.
> I looked at the sources of PdfBox and Ghostscript to see how they handle this 
> case and:
>  * Ghostscript searches through the entire file (see 
> [https://github.com/ArtifexSoftware/ghostpdl/blob/master/pdf/ghostpdf.c] 
> lines 1323-1339)
>  * PdfBox reads the file line by line, and stops looking for the header when  
> it encounters a line that starts with a digit (see 
> [https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java]
>  lines 1561-)
> From the doc in tika-mimetypes.xml for the application/pdf MIME type, I 
> understand that increasing the maximum offset can trigger false positives. I 
> increased it to 768, and the unit tests pass, but I didn't find any PDF that  
> tests this particular case, so either it doesn't exist or there are 
> integration tests that aren't part of this repo ?
> How can I go about testing for regressions ? I can provide a pull request for 
> this change, but where do I put the test PDF and a unit test?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-08 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730728#comment-17730728
 ] 

Nick Burch commented on TIKA-4060:
--

I'm a muppet... had forgotten to escape the hex characters in the regexp when 
transposing into a Tika mime magic match!

Now fixed and applied. Thanks for helping us find this magic

> Add magic to audio/aac in tika-mimetypes.xml
> 
>
> Key: TIKA-4060
> URL: https://issues.apache.org/jira/browse/TIKA-4060
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Gregory Lepore
>Priority: Minor
> Attachments: 
> 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, 
> cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1
>
>
> Currently tika-mimetypes only recognizes audio/aac files by the file 
> extension. PRONOM recently added support for identifying aac files, but the 
> signature is tricky. There are two signatures, below in PRONOM format curly 
> braces mean to look ahead between the two values for the subsequent patterns.
>  
> The first pattern is pretty basic, the second pattern is the first pattern 
> after a 2048 ID3 header.
>  
> ||Name|Audio Data Transport Stream sig.1|
> ||Description|An FF pattern from BOF with variation of byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |
> ||Name|Audio Data Transport Stream sig.2|
> ||Description|ID3 tag variation with variable byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-08 Thread Nick Burch (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-4060.
--
Fix Version/s: 2.8.1
   Resolution: Fixed

> Add magic to audio/aac in tika-mimetypes.xml
> 
>
> Key: TIKA-4060
> URL: https://issues.apache.org/jira/browse/TIKA-4060
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Gregory Lepore
>Priority: Minor
> Fix For: 2.8.1
>
> Attachments: 
> 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, 
> cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1
>
>
> Currently tika-mimetypes only recognizes audio/aac files by the file 
> extension. PRONOM recently added support for identifying aac files, but the 
> signature is tricky. There are two signatures, below in PRONOM format curly 
> braces mean to look ahead between the two values for the subsequent patterns.
>  
> The first pattern is pretty basic, the second pattern is the first pattern 
> after a 2048 ID3 header.
>  
> ||Name|Audio Data Transport Stream sig.1|
> ||Description|An FF pattern from BOF with variation of byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |
> ||Name|Audio Data Transport Stream sig.2|
> ||Description|ID3 tag variation with variable byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-08 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730649#comment-17730649
 ] 

Nick Burch commented on TIKA-4060:
--

0x494443 is the string ID3, which I think ought to be at the start. It is in 
the handful of files I've found. The rest of the magic is pretty vague and a 
little prone to false positives, so I'm reluctant to match on the string "ID3" 
anywhere in the first 2kb and then the vague 3 bytes somewhere else further on.

I've tried to make the matches a little "tighter" to hopefully reduce false 
positives, just seem to have gone too tight - the test file I produced with ID3 
tags does have the ID3 at the start. The hex dump key sections are:

{{ 49 44 33 03 00 00 00 00 09 6b 54 50 45 31 00 00 |ID3..kTPE1..|}}
{{0010 00 0c 00 00 00 54 65 73 74 20 41 72 74 69 73 74 |.Test Artist|}}
{{...}}
{{0090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||}}
{{*}}
{{04f0 00 00 00 00 00 ff f1 50 80 32 5f fc de 02 00 4c |...P.2_L|}}

> Add magic to audio/aac in tika-mimetypes.xml
> 
>
> Key: TIKA-4060
> URL: https://issues.apache.org/jira/browse/TIKA-4060
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Gregory Lepore
>Priority: Minor
> Attachments: 
> 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, 
> cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1
>
>
> Currently tika-mimetypes only recognizes audio/aac files by the file 
> extension. PRONOM recently added support for identifying aac files, but the 
> signature is tricky. There are two signatures, below in PRONOM format curly 
> braces mean to look ahead between the two values for the subsequent patterns.
>  
> The first pattern is pretty basic, the second pattern is the first pattern 
> after a 2048 ID3 header.
>  
> ||Name|Audio Data Transport Stream sig.1|
> ||Description|An FF pattern from BOF with variation of byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |
> ||Name|Audio Data Transport Stream sig.2|
> ||Description|ID3 tag variation with variable byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-07 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730304#comment-17730304
 ] 

Nick Burch commented on TIKA-4060:
--

I have created some small test AAC files using ffmpeg, and then had a go at 
adding the mime magic for the two cases. 

However, detection of the ID3 header case isn't working. Can anyone spot what 
I've done wrong? https://github.com/apache/tika/tree/TIKA-4060

> Add magic to audio/aac in tika-mimetypes.xml
> 
>
> Key: TIKA-4060
> URL: https://issues.apache.org/jira/browse/TIKA-4060
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Gregory Lepore
>Priority: Minor
> Attachments: 
> 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, 
> cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1
>
>
> Currently tika-mimetypes only recognizes audio/aac files by the file 
> extension. PRONOM recently added support for identifying aac files, but the 
> signature is tricky. There are two signatures, below in PRONOM format curly 
> braces mean to look ahead between the two values for the subsequent patterns.
>  
> The first pattern is pretty basic, the second pattern is the first pattern 
> after a 2048 ID3 header.
>  
> ||Name|Audio Data Transport Stream sig.1|
> ||Description|An FF pattern from BOF with variation of byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |
> ||Name|Audio Data Transport Stream sig.2|
> ||Description|ID3 tag variation with variable byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4051) Explore new parsers

2023-06-03 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728992#comment-17728992
 ] 

Nick Burch commented on TIKA-4051:
--

Last time I asked the MPXJ project they weren't interested in switching, but 
it's always worth another try after a few years! Very old plugin is 
https://github.com/Gagravarr/MPXJ-Tika if anyone wants to help bring it a bit 
more up-to-date?

> Explore new parsers
> ---
>
> Key: TIKA-4051
> URL: https://issues.apache.org/jira/browse/TIKA-4051
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> Let's use this ticket as a parking lot for links to parsers that might be 
> interesting to integrate.
> Here's an ASL 2.0 RTFParser: [https://github.com/joniles/rtfparserkit/] 
> single developer, and release was last year.  We'd want to do a bakeoff 
> before making the switch, but it would be nice to offload our custom 
> RTFParser.
>  
> This library parses project plans: [https://github.com/joniles/mpxj] It is 
> LGPL, which is incompatible with ASL 2.0.  So it is a non-starter now, but if 
> there's interest in integrating with Tika, we might ask the mpxj project if 
> they'd have any interest in changing their license.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3999) audio/xm audio/x-mod

2023-05-23 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725561#comment-17725561
 ] 

Nick Burch commented on TIKA-3999:
--

Oh, this brings back memories... good memories :)

Unless we can enlist the help of some dedicated members of the "demo scene", I 
think a parser is unlikely any time soon.

>From the table provided (wow! thanks!), I think we can probably add a whole 
>bunch of subtypes of {{audio/x-mod}} which we can then detect. Just need to 
>use the regression suite to ensure that some of the shorter magic entries are 
>sufficiently unique - the 2-4 byte ones worry me a little bit. May need to add 
>some as subtype with file extension but not magic where it isn't unique enough

> audio/xm audio/x-mod
> 
>
> Key: TIKA-3999
> URL: https://issues.apache.org/jira/browse/TIKA-3999
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4045) DBF/MDB row count extraction

2023-05-19 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724302#comment-17724302
 ] 

Nick Burch commented on TIKA-4045:
--

I guess this could also apply for other row-based formats like SQLite or 
Spreadsheets? Though I'm not sure how best to output it on a per-table / 
per-sheet basis.

For the metadata keys, I guess we could re-use the same ones as we added for 
CSV in TIKA-3938 ?

> DBF/MDB row count extraction
> 
>
> Key: TIKA-4045
> URL: https://issues.apache.org/jira/browse/TIKA-4045
> Project: Tika
>  Issue Type: Improvement
>Reporter: Gregory Lepore
>Priority: Minor
>
> It would be quite helpful for my organization to extract the number of 
> records/rows in any given database file format like DBF or MDB. Along with 
> byte count this would give us a good idea of the amount of information stored 
> in the files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4025) Extract frame count from gifs

2023-05-02 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718674#comment-17718674
 ] 

Nick Burch commented on TIKA-4025:
--

Would a video metadata specification's frame count be a better home? 

XMP seems to have a pretty complex FrameCount type, from a quick glance I 
couldn't spot an obvious property using that but I feel like there ought to be 
one...

> Extract frame count from gifs
> -
>
> Key: TIKA-4025
> URL: https://issues.apache.org/jira/browse/TIKA-4025
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> Over on TIKA-4019, an animated gif example made me realize that we're not 
> currently extracting the number of frames for gifs into the metadata.  We 
> should do this.
>  
> Any recs for the name of the metadata key?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3981) Tika parser meets window system file

2023-02-24 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693140#comment-17693140
 ] 

Nick Burch commented on TIKA-3981:
--

Is this happening for all executables on your machine, or just some? And if is 
there any pattern on which executables are showing sensible dates and which are 
showing future ones?

Does Windows Explorer show a more sensible date?

Can anyone reproduce this with a small file from an open source project?

(We have 8 test files in our test suite, all of which are coming back with 
sensible dates, so need some help to track down more details on this bug!)

> Tika parser meets window system file
> 
>
> Key: TIKA-3981
> URL: https://issues.apache.org/jira/browse/TIKA-3981
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: ASK_Tika_Parser.docx
>
>
> Hi All,
>  
>    I execute the command "java -jar tika-app-2.7.0.jar." and load the 
> windows system execute file where.exe. 
>   You could find the file in your own windows system, 
> c:\Windows\systen32\where.exe.
>   Tika gets the dcterms:created, "2037-03-05T20:49:08Z" , but I get 
> confused the future time. 
>   Could you help check why tika gets the special created date, please?  
>  
>  Attachment is also my testing with several tika versions, for your 
> reference. 
> Thank you.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689199#comment-17689199
 ] 

Nick Burch commented on TIKA-3973:
--

If you only care about container-aware detection for Ogg based formats, you 
should be fine right now with just

{code:java}
implementation 'org.apache.tika:tika-core:2.7.0'
implementation 'org.gagravarr:vorbis-java-tika:0.8'
{code}

The Vorbis Tika module should pull in the other things it needs (such as core)

> Content of Ogg file with Opus encoded content not correctly recognized
> --
>
> Key: TIKA-3973
> URL: https://issues.apache.org/jira/browse/TIKA-3973
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.7.0
>Reporter: Adam Bialas
>Priority: Major
> Attachments: speech_output.ogg
>
>
> We are using tika-core:2.7.0 for file content detection. We have a ogg file 
> which uses Opus audio codec (see attachment). When we try to detect content 
> with metadata:
>  
> {code:java}
> Metadata metadata = new Metadata(); 
> metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, 
> FilenameUtils.getName(url));{code}
> this file is recognized as audio/vorbis which is not ok. Can you please 
> verify?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689176#comment-17689176
 ] 

Nick Burch commented on TIKA-3973:
--

For all container formats you want {{tika-parsers}} or {{tika-parsers-standard}}

If you only care about the Ogg formats, then {{vorbis-java-tika}} from 
{{org.gagravarr}} is enough

> Content of Ogg file with Opus encoded content not correctly recognized
> --
>
> Key: TIKA-3973
> URL: https://issues.apache.org/jira/browse/TIKA-3973
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.7.0
>Reporter: Adam Bialas
>Priority: Major
> Attachments: speech_output.ogg
>
>
> We are using tika-core:2.7.0 for file content detection. We have a ogg file 
> which uses Opus audio codec (see attachment). When we try to detect content 
> with metadata:
>  
> {code:java}
> Metadata metadata = new Metadata(); 
> metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, 
> FilenameUtils.getName(url));{code}
> this file is recognized as audio/vorbis which is not ok. Can you please 
> verify?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689161#comment-17689161
 ] 

Nick Burch edited comment on TIKA-3973 at 2/15/23 2:38 PM:
---

For container-based detection (such as the Ogg container format), you really 
need to include the Tika Parsers jars too.

With the Ogg container detector enabled (which comes with the Tika media 
parsers), Tika can correctly detect the type as {{audio/opus}}

We have magic which will detect an opus file with a single stream if you're 
lucky, but with containers it's very hit-and-miss if you can tell with magic 
alone. Enabling the Ogg container detector is the best solution though, that 
should always work no matter what order the streams are in, what streams are 
contained etc


was (Author: gagravarr):
For container-based detection (such as the Ogg container format), you really 
need to include the Tika Parsers jars too.

With the Ogg container detector enabled (which comes with the Tika media 
parsers), Tika can correctly detect the type as {{audio/opus}}

We have magic which will detect an opus file with a single stream if you're 
lucky, but with containers it's very hit-and-miss if you can tell with magic 
alone. Enabling the Ogg container detector is the best solution though, that 
should always work no matter what order the streams are in, what streams are 
contained etc{{{}
{}}}

> Content of Ogg file with Opus encoded content not correctly recognized
> --
>
> Key: TIKA-3973
> URL: https://issues.apache.org/jira/browse/TIKA-3973
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.7.0
>Reporter: Adam Bialas
>Priority: Major
> Attachments: speech_output.ogg
>
>
> We are using tika-core:2.7.0 for file content detection. We have a ogg file 
> which uses Opus audio codec (see attachment). When we try to detect content 
> with metadata:
>  
> {code:java}
> Metadata metadata = new Metadata(); 
> metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, 
> FilenameUtils.getName(url));{code}
> this file is recognized as audio/vorbis which is not ok. Can you please 
> verify?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689161#comment-17689161
 ] 

Nick Burch commented on TIKA-3973:
--

For container-based detection (such as the Ogg container format), you really 
need to include the Tika Parsers jars too.

With the Ogg container detector enabled (which comes with the Tika media 
parsers), Tika can correctly detect the type as {{audio/opus}}

We have magic which will detect an opus file with a single stream if you're 
lucky, but with containers it's very hit-and-miss if you can tell with magic 
alone. Enabling the Ogg container detector is the best solution though, that 
should always work no matter what order the streams are in, what streams are 
contained etc{{{}
{}}}

> Content of Ogg file with Opus encoded content not correctly recognized
> --
>
> Key: TIKA-3973
> URL: https://issues.apache.org/jira/browse/TIKA-3973
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.7.0
>Reporter: Adam Bialas
>Priority: Major
> Attachments: speech_output.ogg
>
>
> We are using tika-core:2.7.0 for file content detection. We have a ogg file 
> which uses Opus audio codec (see attachment). When we try to detect content 
> with metadata:
>  
> {code:java}
> Metadata metadata = new Metadata(); 
> metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, 
> FilenameUtils.getName(url));{code}
> this file is recognized as audio/vorbis which is not ok. Can you please 
> verify?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3960) PGP encrypted files get detected as application/octet-stream

2023-01-30 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682352#comment-17682352
 ] 

Nick Burch commented on TIKA-3960:
--

If possible, please include a small test file and update 
{{tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java}} to test 
the detection

> PGP encrypted files get detected as application/octet-stream
> 
>
> Key: TIKA-3960
> URL: https://issues.apache.org/jira/browse/TIKA-3960
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.6.0
>Reporter: Tayseer Sabha
>Priority: Major
>
> We use Tika for detecting and validating uploaded files using their 
> content/magic bytes and not only their names/extension.
> Passing a PGP/GPG encrypted file to Tika.detect(InputStream stream) will 
> always return application/octet-stream instead of application/pgp-encrypted 
> defined in tika-mimetypes.xml
> The issue occurs because the application/pgp-encrypted mime-type defined in 
> tika-mimetypes.xml is lacking a magic match and only has  pattern="*.pgp"/>
> I managed to fix the issue for us temporarily by adding 
> application/pgp-encrypted including a magic match in our custom-mimetypes.xml 
> file. I will create a Pull Request on Github with the fix to resolve the 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3703) Consider adding a frictionless data package output format

2023-01-16 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677364#comment-17677364
 ] 

Nick Burch commented on TIKA-3703:
--

I guess we could include a data package metadata file to better describe the 
other files in the zip? 
[https://specs.frictionlessdata.io/data-package/#introduction]

That might make it "more standard" for people to understand what they've got 
and why

> Consider adding a frictionless data package output format
> -
>
> Key: TIKA-3703
> URL: https://issues.apache.org/jira/browse/TIKA-3703
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> For those who want more than just text and metadata, e.g. bytes for 
> thumbnails, or embedded images or embedded files or rendered pages, it would 
> be great to return that data in a standard format. Our current /unpack 
> endpoint uses a zip file but with our own "standard".
> I was thinking about heading down the pure json option by including these 
> byte streams as base64 encoded metadata values in our current metadata 
> object. Not sure which is the better way to go.
> I'm opening this issue to discuss options.
>  
> Reference: [https://frictionlessdata.io/standards/#standards-toolkit]
> We'd want to make this available as an endpoint on tika-server 
> (\{{/v2/unpack}} or something else?) and as a commandline option in tika-app.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3703) Consider adding a frictionless data package output format

2023-01-16 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677326#comment-17677326
 ] 

Nick Burch commented on TIKA-3703:
--

A zip file gives you compression, and most clients won't accidentally try to 
buffer it in memory. JSON with base-64 encoded data is negative compression, 
and a high risk of clients OOM-ing due to trying to fit all of the raw JSON and 
parsed JSON in memory at once

(If it was just thumbnails then I could see some advantages of JSON, but it 
also works on container formats with potentially huge contents)

In terms of recursion, I think it should be off on the default endpoint (as 
now), but with another that supports it. Maybe eg {{/unpack}} and 
{{/unpack/recursive}} ?

> Consider adding a frictionless data package output format
> -
>
> Key: TIKA-3703
> URL: https://issues.apache.org/jira/browse/TIKA-3703
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> For those who want more than just text and metadata, e.g. bytes for 
> thumbnails, or embedded images or embedded files or rendered pages, it would 
> be great to return that data in a standard format. Our current /unpack 
> endpoint uses a zip file but with our own "standard".
> I was thinking about heading down the pure json option by including these 
> byte streams as base64 encoded metadata values in our current metadata 
> object. Not sure which is the better way to go.
> I'm opening this issue to discuss options.
>  
> Reference: [https://frictionlessdata.io/standards/#standards-toolkit]
> We'd want to make this available as an endpoint on tika-server 
> (\{{/v2/unpack}} or something else?) and as a commandline option in tika-app.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3955) separate dependencies from tika-app-2.6.0-noasm-nojson

2023-01-12 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17675914#comment-17675914
 ] 

Nick Burch commented on TIKA-3955:
--

The Tika App is intended as a "batteries included" standalone app.

If you are adding Tika to a Java app, you should add the Java library. Include 
`tika-core` and as many of the `tika-parser-*` parsers as your application 
needs. Doing that via Maven or Gradle will allow you to manage any dependency 
clashes

> separate dependencies from tika-app-2.6.0-noasm-nojson 
> ---
>
> Key: TIKA-3955
> URL: https://issues.apache.org/jira/browse/TIKA-3955
> Project: Tika
>  Issue Type: Wish
>Reporter: Dhoka Pramod
>Priority: Major
>
> Hi Team,
> We are using tika-app-2.6.0-noasm-nojson.jar (uber jar) and it is bundled 
> with all the required third-party jars as mentioned below
> activation-1.1.1.jar
> bcmail-jdk18on-1.72.jar
> bcpkix-jdk18on-1.72.jar
> bcprov-jdk18on-1.72.jar
> byte-buddy-1.12.7.jar
> commons-cli-1.4.jar
> commons-codec-1.15.jar
> commons-collections4-4.1.jar
> commons-compress-1.21.jar
> commons-exec-1.0.jar
> commons-io-2.11.0.jar
> commons-lang3-3.8.1.jar
> commons-logging-1.1.1.jar
> gson-2.9.0.jar
> jackson-core-2.14.0.jar
> jackson-databind-2.14.0.jar
> jaxb-impl-2.1.13.jar
> jaxen-1.1.6.jar
> juniversalchardet-1.0.3.jar
> log4j-api-2.19.0.jar
> log4j-core-2.19.0.jar
> slf4j-api-1.7.36.jar
> xercesImpl.jar
> xmlbeans-3.1.0.jar
> Our application also adds the above jars as it requires. This is leading to 
> duplicate classes on the classpath. Could you provide the tika-app jar 
> (skinny jar) and a list of required dependencies so that we will add them to 
> our application classpath to avoid duplicates.
> Thank you.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656060#comment-17656060
 ] 

Nick Burch commented on TIKA-3952:
--

Is the PDF a scan? Are you doing OCR?

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656049#comment-17656049
 ] 

Nick Burch commented on TIKA-3952:
--

Can you try following the steps in 
[https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems]
 ?

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-2536) Move to later edu.ucar version to avoid EOL dependencies

2022-11-02 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627638#comment-17627638
 ] 

Nick Burch commented on TIKA-2536:
--

We can only depend on versions in maven central, we can't depend on versions 
hosted elsewhere

If newer versions have been formally released, ideally the project owners would 
upload them to central. If they can't/won't and we can get that confirmed, we 
may be able to get them uploaded on their behalf, but it's much better and 
easier if the project owners upload themselves! OSSRH is often the best way for 
independent maintainers not part of a bigger foundation to get their releases 
into central.

If the version currently in maven central will play nicely with a new version 
of a dependency, short-term we ought to be able to pull that in and exclude the 
old version. If it doesn't play nicely, our only option is to upgrade the whole 
lot, which needs to be in central

> Move to later edu.ucar version to avoid EOL dependencies
> 
>
> Key: TIKA-2536
> URL: https://issues.apache.org/jira/browse/TIKA-2536
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.16, 1.17
> Environment: All
>Reporter: Richard Jones
>Priority: Major
>
> The currently referenced 4.5.5 versions of edu.ucar:grib and edu.ucar:cdm 
> (released in Mar 2015), as well as being branch EOL themselves, depend on 
> many other project/branch/version EOL artifacts for which much later and 
> active versions are often available. The list is as follows:
> - edu.ucar:grib depends on the project EOL bzip2. Much more recent versions 
> of edu.ucar:grib exist that no longer depend on bzip2 (note: Jbzip2 is hosted 
> on the Google Code site, which was shut down for active development in 2015.  
> The project was never migrated to another site, e.g. Github).
> - edu.ucar:grib depends on the 2.0.4 EOL version of org.jdom:jdom2
> - edu.ucar:cdm depends on the 2.6.2 branch EOL version of 
> net.sf.ehcache:ehcache-core
> - edu.ucar:cdm depends on the 2.2.0 EOL version of 
> org.quartz-scheduler:quartz for which active versions are available. In turn 
> org.quartz-scheduler:quartz depends on the 0.9.1.1 branch EOL version of 
> c3p0:c3p0. Later versions of quartz have moved to the active com.mchange:c3p0
> - edu.ucar:grib depends on the 2.5.0 branch EOL version of 
> com.google.protobuf:protobuf-java for which active versions are available.
> Request moving to a much later version of edu.ucar, or alternative artifacts 
> to address all the above EOL issues (lack of active support for 
> vulnerabilities and bugs).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620633#comment-17620633
 ] 

Nick Burch commented on TIKA-3890:
--

DOCX files are compressed XML. Text compresses very well. Already compressed 
images, audio, video don't.

An 8mb word document of pure text could fairly easily produce a 10x that in 
text. An 8mb word document that's mostly images could produce just a few bytes 
of text

DOCX-specific, you could open the file in POI (use a File to save memory), and 
check the size of the word XML stream and the size of any attachments, that'd 
give you a vague idea. However, it won't give you a complete answer as the word 
XML could have loads of complex stuff in it that doesn't end up with text 
output...

Easiest way to know the size of the output is just to parse it on a beefy 
machine with suitable restarts / respawning in place, and see what you get!

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620610#comment-17620610
 ] 

Nick Burch commented on TIKA-3890:
--

The only way to be sure of how many pages are in a Word document is to render 
it (to screen / PDF / printer)

Some Word files get lucky and have a sensible number in the metadata set by 
Word from when it last opened the file and felt like populating statistics, but 
that's by no means always the case

If you're fairly sure your documents have sensible metadata, you could always 
pre-process with Apache POI. If you provide a File object and only read the 
metadata streams, it's pretty memory efficient to query

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3850) Spanish text is incorrectly detected as Galician

2022-09-13 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603483#comment-17603483
 ] 

Nick Burch commented on TIKA-3850:
--

The kind of statistical language model used in Tika struggles with very short 
text. What happens if you feed a longer block of spanish language text in?

> Spanish text is incorrectly detected as Galician
> 
>
> Key: TIKA-3850
> URL: https://issues.apache.org/jira/browse/TIKA-3850
> Project: Tika
>  Issue Type: Bug
>  Components: languageidentifier
>Affects Versions: 2.4.1
> Environment: org.apache.tika:tika-langdetect-optimaize:2.4.1
> org.apache.tika:tika-core:2.4.1
>Reporter: Lenne Hendrickx
>Priority: Minor
>
> The following Spanish text is incorrectly detected as Galician.
> {noformat}
> Hola! Donde puedo contactar para una garantía?{noformat}
> The es and gl models are loaded into the language detector.
> Language result:
> {noformat}
> language: gl
> score: 0.95{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3308) SVG file without xml declaration tag is detected as text/plain

2022-09-12 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603038#comment-17603038
 ] 

Nick Burch commented on TIKA-3308:
--

Our HTML mime type has both root-XML tags for well-formed documents, and a 
bunch of magic for the rest. So, adding some magic as well for these documents 
is in theory possible

Checking for {{http://www.w3.org/2000/svg"}} with a decent priority 
should be fine, but I'm not sure we'd want to look for just {{ SVG file without xml declaration tag is detected as text/plain
> --
>
> Key: TIKA-3308
> URL: https://issues.apache.org/jira/browse/TIKA-3308
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.25
>Reporter: Anas Hammani
>Priority: Minor
> Attachments: logo-luma.svg
>
>
> The SVG file attached to the issue is interpreted as *text/plain* by
> {code:java}
> tika.detect(filePath){code}
>  
> If I add 
> {code:java}
>   {code}
> at the beginning of the file, then tika detects it as  "image/svg+xml"
>  
> When i read the documentation i see that xml is not necessary for a file to 
> be well-formed
> [https://www.w3.org/TR/REC-xml/#sec-prolog-dtd]
>  
> It will be great if tika can detect a file as a SVG without the prolog
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3832) Required array length is too large (OOM) error when reading a PDF file

2022-08-05 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575814#comment-17575814
 ] 

Nick Burch commented on TIKA-3832:
--

Any chance you could try with Apache PDFBox directly? They've got a handy 
command line tool you can use:

[https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems]

That will help us narrow down if it's a Tika bug, or one in the underlying 
PDFBox library

> Required array length is too large (OOM) error when reading a PDF file
> --
>
> Key: TIKA-3832
> URL: https://issues.apache.org/jira/browse/TIKA-3832
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Lakatos Gyula
>Priority: Major
> Attachments: 7581cfbf-8c1e-4154-bfbb-4e633d858d5f.pdf
>
>
> I'm working on a web crawler and it got obliterated with an OutOfMemory error 
> by a random PDF from the internet.
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Required array length 
> 2147483638 + 14 is too large
>   at 
> java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649)
>   at 
> java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642)
>   at 
> java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:257)
>   at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:229)
>   at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>   at java.base/java.lang.StringBuffer.append(StringBuffer.java:410)
>   at java.base/java.io.StringWriter.write(StringWriter.java:99)
>   at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:108)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:160)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:81)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
>   at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:977)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:981)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:959)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:907)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:239)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196)
>   at com.example.TikaOOMExample.main(TikaOOMExample.java:31)
> {code}
> I reproduced the error in this repository:
> [https://github.com/laxika/apache-tika-oom-reproduction|http://example.com/]
> Uploaded the PDF into the attachments as well. It can be opened and read by 
> the PDF readers I tried (Edge, Adobe, Chrome).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (TIKA-3830) Kaspersky identified a file as riskware

2022-08-03 Thread Nick Burch (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-3830.
--
Resolution: Duplicate

> Kaspersky identified a file as riskware
> ---
>
> Key: TIKA-3830
> URL: https://issues.apache.org/jira/browse/TIKA-3830
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app
>Affects Versions: 2.4.1
> Environment: Windows OS
>Reporter: Haralambos Marmanis
>Priority: Major
>
> NOTE: The issue is with component tika-parsers but that doesn't appear in the 
> dropdown list above. 
> Kaspersky +detected and removed+ the following file: quine.gz
> Worth to mention that such file (quine.gz) isn’t malware related but instead 
> has been categorized as a +Risk Ware+ (It infinitely decompress itself).
> File Path: 
> C:\Code\tika-2.4.1\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pkg-module\src\test\resources\test-documents\
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3829) java.lang.IllegalArgumentException: The document is really a XLS file exception while parsing doc file

2022-08-03 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574656#comment-17574656
 ] 

Nick Burch commented on TIKA-3829:
--

Can you share a file that triggers this bug?

The method in question should only process the summary stream if it exists, so 
something very odd is going on here

> java.lang.IllegalArgumentException: The document is really a XLS file 
> exception while parsing doc file
> --
>
> Key: TIKA-3829
> URL: https://issues.apache.org/jira/browse/TIKA-3829
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.23
>Reporter: Dhanabal
>Priority: Major
>
> Getting following exception while parsing doc file:
> WARN  Ignoring unexpected exception while parsing summary entry 
> DocumentSummaryInformation
> java.lang.IllegalArgumentException: The document is really a XLS file
>     at 
> org.apache.poi.poifs.filesystem.DirectoryNode.getEntry(DirectoryNode.java:322)
>     at 
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:82)
>     at 
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74)
>     at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155)
>     at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>  
> What is the meaning of this exception? when it will be thrown?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-14 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566991#comment-17566991
 ] 

Nick Burch commented on TIKA-3814:
--

I have a feeling that the Text content handler might rely on these coming 
through in the character stream to nicely-ish format the text output?

I do agree that a custom content handler that tracks if it's inside of the "no 
breaks wanted" tags, and skips newlines in the character stream if so, is 
likely to be the likely-best solution here

> Extracted text from HTML file does not exclude newline chars from body
> --
>
> Key: TIKA-3814
> URL: https://issues.apache.org/jira/browse/TIKA-3814
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sai Konuri
>Priority: Minor
> Attachments: bug.html, image-2022-07-06-19-08-30-437.png, 
> image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a 
> ,,, etc, the text that is extracted is not excluding those 
> newlines. 
> A sample html file is attached.
>  
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>  
> {*}Actual{*}: 
> !image-2022-07-06-19-09-54-534.png!
>  
>  
> This is the code I am using to extract the text of the HTML file: 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream = 
> this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
> parser.parse(stream, handler, metadata);
> System.out.println(handler);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-11 Thread Nick Burch (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-3814:
-
Priority: Trivial  (was: Blocker)

> Extracted text from HTML file does not exclude newline chars from body
> --
>
> Key: TIKA-3814
> URL: https://issues.apache.org/jira/browse/TIKA-3814
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sai Konuri
>Priority: Trivial
> Attachments: bug.html, image-2022-07-06-19-08-30-437.png, 
> image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a 
> ,,, etc, the text that is extracted is not excluding those 
> newlines. 
> A sample html file is attached.
>  
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>  
> {*}Actual{*}: 
> !image-2022-07-06-19-09-54-534.png!
>  
>  
> This is the code I am using to extract the text of the HTML file: 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream = 
> this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
> parser.parse(stream, handler, metadata);
> System.out.println(handler);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562599#comment-17562599
 ] 

Nick Burch commented on TIKA-3811:
--

Maybe [~tallison] has an idea on the config part, he's been working on that 
area lately...

> Exclude NameDetector not working for Tika.detect(file)
> --
>
> Key: TIKA-3811
> URL: https://issues.apache.org/jira/browse/TIKA-3811
> Project: Tika
>  Issue Type: Bug
>  Components: config, core, detector
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Attachments: invalid_format.vtt, tika-config_test.xml
>
>
> I need to detect mime type for a file but for security reason I want to 
> exclude the detection by file name extension. 
> I added a tika-config_test.xml (see attached) to my unit test but it still 
> detects file by name extension.
> I attached a test file that is wrongly detected as text/vtt because of the 
> file extension, it should be text/plain in this case.
>  
> The code of my unit test:
> {code:java}
> File file = new 
> File(getClass().getClassLoader().getResource("invalid_format.vtt").getFile());
> TikaConfig tikaConfig = new TikaConfig(this.getClass()
> .getClassLoader()
> .getResourceAsStream("tika-config_test.xml"));
>  
> // returns text/vtt but should be text/plain
> String mimeType = new Tika(tikaConfig).detect(file); 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562537#comment-17562537
 ] 

Nick Burch commented on TIKA-3811:
--

You should not be using Apache Tika's detection for anything security related. 
We do not protect against people maliciously adding mime magic near the start 
of the file which still allows the underlying file to be processed by the 
correct application. We err on the side of giving a best-guess answer.

For the "what is this probably" case, Tika is great. For the "what parser is 
most likely to manage to get text out" case, Tika is great. For "what is this 
for certain even if it is malicious" you need a different tool for your 
detection.

See also 
[https://cwiki.apache.org/confluence/display/TIKA/The+Robustness+of+Apache+Tika]
 for advice on running Tika with untrusted input

> Exclude NameDetector not working for Tika.detect(file)
> --
>
> Key: TIKA-3811
> URL: https://issues.apache.org/jira/browse/TIKA-3811
> Project: Tika
>  Issue Type: Bug
>  Components: config, core, detector
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Attachments: invalid_format.vtt, tika-config_test.xml
>
>
> I need to detect mime type for a file but for security reason I want to 
> exclude the detection by file name extension. 
> I added a tika-config_test.xml (see attached) to my unit test but it still 
> detects file by name extension.
> I attached a test file that is wrongly detected as text/vtt because of the 
> file extension, it should be text/plain in this case.
>  
> The code of my unit test:
> {code:java}
> File file = new 
> File(getClass().getClassLoader().getResource("invalid_format.vtt").getFile());
> TikaConfig tikaConfig = new TikaConfig(this.getClass()
> .getClassLoader()
> .getResourceAsStream("tika-config_test.xml"));
>  
> // returns text/vtt but should be text/plain
> String mimeType = new Tika(tikaConfig).detect(file); 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (TIKA-3810) Vtt file (encoding UTF-8 with BOM) seen as text/plain

2022-07-05 Thread Nick Burch (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-3810.
--
Fix Version/s: 2.4.2
   Resolution: Fixed

> Vtt file (encoding UTF-8 with BOM) seen as text/plain
> -
>
> Key: TIKA-3810
> URL: https://issues.apache.org/jira/browse/TIKA-3810
> Project: Tika
>  Issue Type: Bug
>  Components: core, detector, mime
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Fix For: 2.4.2
>
> Attachments: s5_windowEncoding_validFormat.vtt
>
>
> Vtt file created on Windows (UTF-8 {+}with BOM{+}) is incorrectly detected as 
> _text/plain_ type and it should be _text/vtt_ .
> The application using Tika and where the file is uploaded for mime type 
> detection is an Unix machine. 
> The vtt file is passed as inputstream to the Tika's default detector (we 
> don't want to detect mime type by the file extension).
> Please find attached the vtt file that Tika is detecting as text/plain .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3810) Vtt file (encoding UTF-8 with BOM) seen as text/plain

2022-07-05 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562532#comment-17562532
 ] 

Nick Burch commented on TIKA-3810:
--

Looks like we had detection magic for the UTF16 variant BOMs but not the UTF8 
one. Fixed in 9d928bbf9

> Vtt file (encoding UTF-8 with BOM) seen as text/plain
> -
>
> Key: TIKA-3810
> URL: https://issues.apache.org/jira/browse/TIKA-3810
> Project: Tika
>  Issue Type: Bug
>  Components: core, detector, mime
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Attachments: s5_windowEncoding_validFormat.vtt
>
>
> Vtt file created on Windows (UTF-8 {+}with BOM{+}) is incorrectly detected as 
> _text/plain_ type and it should be _text/vtt_ .
> The application using Tika and where the file is uploaded for mime type 
> detection is an Unix machine. 
> The vtt file is passed as inputstream to the Tika's default detector (we 
> don't want to detect mime type by the file extension).
> Please find attached the vtt file that Tika is detecting as text/plain .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3809) OutOfMemoryError occurs while reading doc file

2022-07-05 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562484#comment-17562484
 ] 

Nick Burch commented on TIKA-3809:
--

If the uncompressed XML is 250mb, then you're going to need a heap a lot lot 
bigger than 750mb = 3x the uncompressed size, if you want to use the DOM-based 
parsers. I'd try with about 3gb (so a bit over 10x) and be prepared to go up to 
about 20x uncompressed size for your heap

> OutOfMemoryError occurs while reading doc file
> --
>
> Key: TIKA-3809
> URL: https://issues.apache.org/jira/browse/TIKA-3809
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.23
>Reporter: earl
>Priority: Blocker
>
> OutOfMemoryError occurs while parsing a docx file of size 8 MB (uncompressed 
> size 250 MB). while analyzing the heapdump(.hprof), the thread that parses 
> the file consumes about 750 MB heap size. while looking into a 
> dominator_tree, 
> {code:java}
> org.apache.xmlbeans.impl.store.Xobj$ElementXobj
> {code}
>  This object has been created many times!
> I've also attached the stacktrace,
> {code:java}
> at 
> org.apache.xmlbeans.impl.store.Cur.createElementXobj(Lorg/apache/xmlbeans/impl/store/Locale;Ljavax/xml/namespace/QName;Ljavax/xml/namespace/QName;)Lorg/apache/xmlbeans/impl/store/Xobj;
>  (Cur.java:260)
>   at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.startElement(Ljavax/xml/namespace/QName;)V
>  (Cur.java:2997)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Lorg/xml/sax/Attributes;)V
>  (Locale.java:3164)
>   at 
> org.apache.xerces.parsers.AbstractSAXParser.startElement(Lorg/apache/xerces/xni/QName;Lorg/apache/xerces/xni/XMLAttributes;Lorg/apache/xerces/xni/Augmentations;)V
>  (Unknown Source)
>   at 
> org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Lorg/apache/xerces/xni/QName;Lorg/apache/xerces/xni/XMLAttributes;Lorg/apache/xerces/xni/Augmentations;)V
>  (Unknown Source)
>   at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement()Z 
> (Unknown Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Z)Z
>  (Unknown Source)
>   at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Z)Z 
> (Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Z)Z (Unknown Source)
>   at 
> org.apache.xerces.parsers.XML11Configuration.parse(Lorg/apache/xerces/xni/parser/XMLInputSource;)V
>  (Unknown Source)
>   at 
> org.apache.xerces.parsers.XMLParser.parse(Lorg/apache/xerces/xni/parser/XMLInputSource;)V
>  (Unknown Source)
>   at 
> org.apache.xerces.parsers.AbstractSAXParser.parse(Lorg/xml/sax/InputSource;)V 
> (Unknown Source)
>   at 
> org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Lorg/xml/sax/InputSource;)V
>  (Unknown Source)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Lorg/apache/xmlbeans/impl/store/Locale;Lorg/xml/sax/InputSource;Lorg/apache/xmlbeans/XmlOptions;)Lorg/apache/xmlbeans/impl/store/Cur;
>  (Locale.java:3422)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Ljava/io/InputStream;Lorg/apache/xmlbeans/SchemaType;Lorg/apache/xmlbeans/XmlOptions;)Lorg/apache/xmlbeans/XmlObject;
>  (Locale.java:1272)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Lorg/apache/xmlbeans/SchemaTypeLoader;Ljava/io/InputStream;Lorg/apache/xmlbeans/SchemaType;Lorg/apache/xmlbeans/XmlOptions;)Lorg/apache/xmlbeans/XmlObject;
>  (Locale.java:1259)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(Ljava/io/InputStream;Lorg/apache/xmlbeans/SchemaType;Lorg/apache/xmlbeans/XmlOptions;)Lorg/apache/xmlbeans/XmlObject;
>  (SchemaTypeLoaderBase.java:345)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Ljava/io/InputStream;Lorg/apache/xmlbeans/XmlOptions;)Lorg/openxmlformats/schemas/wordprocessingml/x2006/main/DocumentDocument;
>  (Unknown Source)
>   at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead()V 
> (XWPFDocument.java:178)
>   at 
> org.apache.poi.ooxml.POIXMLDocument.load(Lorg/apache/poi/ooxml/POIXMLFactory;)V
>  (POIXMLDocument.java:184)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(Lorg/apache/poi/openxml4j/opc/OPCPackage;)V
>  (XWPFDocument.java:138)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(Lorg/apache/poi/openxml4j/opc/OPCPackage;)V
>  (XWPFWordExtractor.java:60)
>   at 
> org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(Lorg/apache/poi/openxml4j/opc/OPCPackage;)Lorg/apache/poi/extractor/POITextExtractor;
>  (ExtractorFactory.java:224)
>   at 
>

[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557343#comment-17557343
 ] 

Nick Burch commented on TIKA-3798:
--

With no file, no thread dump and no stack trace, it won't be easy to find the 
relevant code in Tika that isn't behaving properly. As everyone working on Tika 
is a volunteer, you're probably going to have to help us a bit more...

Can you talk your client through taking a Java thread dump and get them to 
share it? Can you get the file, run it yourself through Tika to trigger the 
issue and take a thread dump? Can you share the file privately with one of us?

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
> Attachments: MicrosoftTeams-image.png
>
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557319#comment-17557319
 ] 

Nick Burch commented on TIKA-3798:
--

Do you have a sample file that shows the problem? A thread dump showing the 
place that Tika gets stuck? Suggestions on how we can reproduce your issue?

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-06-09 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552078#comment-17552078
 ] 

Nick Burch commented on TIKA-3768:
--

If we can put something into a properly typed + structured metadata field, we 
will!

The full list of metadata property definitions are spread across the interface 
in 
[https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/package-summary.html]
 grouped by type. Wherever possible we re-use existing well known definitions

While we always store the metadata values as strings, the definition properties 
will help you turn it back into the underlying java types, eg get the date back 
as a java Date

> message/rfc822 does not include Headers in extracted text
> -
>
> Key: TIKA-3768
> URL: https://issues.apache.org/jira/browse/TIKA-3768
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents, 
> such as the attached [^email.txt], the extracted text does not include any of 
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm 
> surprised it's not there based on the include everything bias I saw on 
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
> parser, my debugging appears to show 
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we 
> get the full text, but the returned content type is 'message/rfc822; 
> charset=windows-1252'.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2022-06-05 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550223#comment-17550223
 ] 

Nick Burch commented on TIKA-3784:
--

We don't currently have any Mime Magic for PKCS12 files

Based on 
[https://stackoverflow.com/questions/33239875/jks-bks-and-pkcs12-file-formats] 
it won't be an easy one to cope with, since we don't currently have an ASN.1 
container detector

I think we can potentially get away with a slightly hacky approach similar to 
the PKCS7 signature, where we look for a few variants and hope the right entry 
comes first... "openssl asn1parse" should help with working out what to look for

(Assuming no-one has a bit of time to knock up an ASN1 container detector based 
on the BouncyCastle ASN.1 using an approach similar to 
[https://stackoverflow.com/questions/10190795/parsing-asn-1-binary-data-with-java]
 )

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-06-05 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550216#comment-17550216
 ] 

Nick Burch commented on TIKA-3768:
--

I wouldn't expect to find those in the textual content after parsing, those 
fields should be ending up in the Metadata object instead

We have a bunch of unit tests for mail parsing which shows that, for our test 
files at least, that subject + from + to all coming through, see 
[https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java]

Are you able to compare your code with that in the unit test, and see any 
differences between the working test and yours? Bonus marks if you can write a 
small failing junit unit test that shows the issue with your file

> message/rfc822 does not include Headers in extracted text
> -
>
> Key: TIKA-3768
> URL: https://issues.apache.org/jira/browse/TIKA-3768
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents, 
> such as the attached [^email.txt], the extracted text does not include any of 
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm 
> surprised it's not there based on the include everything bias I saw on 
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
> parser, my debugging appears to show 
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we 
> get the full text, but the returned content type is 'message/rfc822; 
> charset=windows-1252'.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3771) Regression from TIKA-3687: Files wrongly detected as EML

2022-05-20 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539993#comment-17539993
 ] 

Nick Burch commented on TIKA-3771:
--

The PNG magic is priority 50, which is also what our EML min-match 2 is at. 
That's probably fine for most of them, but \nX- is seemingly too general

I think we probably need to lower the priority on the 0:1024 cases, though I'm 
not sure if we can do that without moving that whole block down?

FWIW your PNG matches because it has a URL followed by a bunch of HTTP response 
headers at the end of it!

> Regression from TIKA-3687: Files wrongly detected as EML 
> -
>
> Key: TIKA-3771
> URL: https://issues.apache.org/jira/browse/TIKA-3771
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Luís Filipe Nassif
>Priority: Major
> Attachments: BEA498353ECFA1C440365BB434BBC228269917D7.png
>
>
> Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, 
> I detected some hundreds of samples from 1M of different file types now are 
> being detected as EML. This is caused by the  type="string" offset="0:1024"/> rule added in TIKA-3687 in the 
> minShouldMatch="2" clause. Attached is a sample PNG file that triggers this 
> (it also has another \nDate: value in the first 1024 bytes).
> Another not related thing, I tried to override the message/rfc822 mime 
> definition with a custom-tika-mimetypes.xml in classpath, but it had no 
> effect, it used to work in Tika-1.x. Was that change intentional? I think 
> user definitions should take precedence over Tika definitions, since they can 
> change depending on domain or context (e.g. the same extension may be used by 
> different applications). If it wasn't intentional, I'll open other issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539594#comment-17539594
 ] 

Nick Burch commented on TIKA-3710:
--

As a "normal" html file wouldn't start with these snippets, and they're already 
at a pretty high priority, I think just leave them in the 60 block along with 
the more typical starting tags we have there now

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539582#comment-17539582
 ] 

Nick Burch commented on TIKA-3710:
--

I was thinking we'd do (open)h1(close) or (open)h1(space) to cover both HTML 
cases but reduce the changes of a false positive match (+h2/h3)

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538896#comment-17538896
 ] 

Nick Burch commented on TIKA-3710:
--

The h1 isn't quite as unique as we might like, and maybe not as good as some of 
the other ones

How about changing that to  or  HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-29 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529977#comment-17529977
 ] 

Nick Burch commented on TIKA-3571:
--

Some formats support the concept of pages and we can pass that along (eg pdf, 
ppt), some don't store page related info in the file format so we can't no 
matter how much people might like us to (eg doc, rtf), and some don't have any 
real concept of a page / are only ever single page (eg jpg, mp3). Potentially 
also the category of ones which don't normally have a concept of a page until 
you try to print (eg xls, ods, CAD formats)

Paged formats are a bit of a special case, but in some systems also a common 
one!

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-29 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529918#comment-17529918
 ] 

Nick Burch commented on TIKA-3742:
--

Sure! Potentially easiest is if you create your own fork of Tika on Github, 
create a branch, and work on that. You can then share that branch with us to 
review, feedback on etc. When it's all working, you can then create a pull 
request for us to merge straight into Tika!

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529417#comment-17529417
 ] 

Nick Burch commented on TIKA-3742:
--

I believe {{readNBytes}} only came in with Java 9, and the particular 
{{readNBytes(int)}} in Java 11, so you'll need to use a newer JVM. Should be 
able to replace it with Commons IO calls once we're happy with the general 
logic + approach

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529101#comment-17529101
 ] 

Nick Burch commented on TIKA-3742:
--

Assuming we just want type=17 text elements of a DGNv7 file (as per 
[http://dgnlib.maptools.org/dgn.html#type17] ) then a quick'n'dirty parser 
wouldn't be too bad 
[https://gist.github.com/Gagravarr/90d390fec7c5f2c5cf966c0eedccac5c] is a basic 
reader that finds these texts elements and prints them

Couldn't immediately spot any useful metadata elements to pull out, so I think 
a basic parser would just be the text for DGN7

Anyone fancy finishing this off into a "proper" Tika parser? :)

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529038#comment-17529038
 ] 

Nick Burch commented on TIKA-3742:
--

In theory you shouldn't need any java code at all if you don't want, just an 
xml file with a magic well-known name

We've a couple already in Tika, mostly focused on metadata:

[https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml]

Pop your own one on the classpath and it should be picked up dynamically at 
runtime

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529029#comment-17529029
 ] 

Nick Burch commented on TIKA-3742:
--

If it can just be run standalone and then {{ExternalParser}} + 
{{tika-external-parsers.xml}} is probably the way to go - that already handles 
testing if the program is installed, spawning it, cleaning up, grabbing text etc

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files

2022-04-26 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528157#comment-17528157
 ] 

Nick Burch commented on TIKA-3731:
--

We already do a prefix for several other formats for custom metadata keys, so 
makes sense to me

> Tika CAD DWG reader not pulling meta data from new cad files
> 
>
> Key: TIKA-3731
> URL: https://issues.apache.org/jira/browse/TIKA-3731
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: AutoCAD 2018 format (1).dwg, testDWG-AC1027.dwg
>
>
>  
> The tika DWG reader is only pulling meta data from up to drawing format 
> AC1024  (see code snippet) where it looks to be AC1027 & AC1032 can also be 
> read from the same get2007and2010Props meta data extractor.
> {code:java}
>  switch (version) {
>             case "AC1015":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipTo2000PropertyInfoSection(stream, header)) {
>                     get2000Props(stream, metadata, xhtml);
>                 }
>                 break;
>             case "AC1018":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipToPropertyInfoSection(stream, header)) {
>                     get2004Props(stream, metadata, xhtml);
>                 }
>                 break;
>             case "AC1021":
>             case "AC1024":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipToPropertyInfoSection(stream, header)) {
>                     get2007and2010Props(stream, metadata, xhtml);
>                 }
>                 break;
>             default:
>                 throw new TikaException("Unsupported AutoCAD drawing version: 
> " + version);
>         } {code}
> Looks like the case statement just needs extending and for examples files to 
> be created for AC1027/AC1032. 
> Current versions of auto cad can be found here:
> https://knowledge.autodesk.com/support/autocad/learn-explore/caas/sfdcarticles/sfdcarticles/drawing-version-codes-for-autocad.html
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-24 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527158#comment-17527158
 ] 

Nick Burch commented on TIKA-3719:
--

Linux and Mac will need quotes around arguments containing spaces. As would 
Windows in the WSL subsystem

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks
>
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-23 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526776#comment-17526776
 ] 

Nick Burch commented on TIKA-3721:
--

We already have a few file types which we send to {{OfficeParser}} only for 
common metadata, no content. Project is one such format. As it's better than 
nothing, could always do that for DGN v8 files?

{{SummaryExtractor}} already supports custom properties with the 
{{Office.USER_DEFINED_METADATA_NAME_PREFIX}} prefix so I'd expect those to come 
through if you called OfficeParser (assuming they didn't do something odd and 
put their custom properties in one of the standard streams rather than the 
custom properties stream)

> DGN parser
> --
>
> Key: TIKA-3721
> URL: https://issues.apache.org/jira/browse/TIKA-3721
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: Screenshot from 2022-04-22 16-03-44.png, 
> dgn8s-dumped.txt, image-2022-04-22-20-00-45-704.png, 
> image-2022-04-22-20-01-09-564.png, image-2022-04-22-20-02-24-180.png
>
>
> Does anyone have any experience with the DGN file format by MicroStation? I 
> see TIKA doesn't have a parser so would it be possible to create one? 
> https://docs.fileformat.com/cad/dgn/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-22 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526352#comment-17526352
 ] 

Nick Burch commented on TIKA-3721:
--

The mime types mentioned at 
[https://communities.bentley.com/products/projectwise/w/wiki/5617/5617] don't 
match our normal convention nor the conventions from other formats, so I'd 
propose

Common base with the globs - {{image/vnd.dgn}}

version 7 - {{image/vnd.dgn;version=7}}

version 8 - {{image/vnd.dgn;version=8}} with an alias of {{image/vnd.dgn;ver=8}}

> DGN parser
> --
>
> Key: TIKA-3721
> URL: https://issues.apache.org/jira/browse/TIKA-3721
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> Does anyone have any experience with the DGN file format by MicroStation? I 
> see TIKA doesn't have a parser so would it be possible to create one? 
> https://docs.fileformat.com/cad/dgn/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-22 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526336#comment-17526336
 ] 

Nick Burch commented on TIKA-3721:
--

We've had the OK from the author of the tika-dgn-detector

I'd propose to create a image/vnd.dgn type which gets the globs, then v7 with 
the magic as a subtype and the v8 with no magic which the detector would 
return. That's slightly different to what tika-dgn-detector has though, but 
more in keeping with our other "versions are actually very different kinds of 
files" formats.

> DGN parser
> --
>
> Key: TIKA-3721
> URL: https://issues.apache.org/jira/browse/TIKA-3721
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> Does anyone have any experience with the DGN file format by MicroStation? I 
> see TIKA doesn't have a parser so would it be possible to create one? 
> https://docs.fileformat.com/cad/dgn/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-22 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526324#comment-17526324
 ] 

Nick Burch commented on TIKA-3721:
--

That detector is written in Kotlin, but should be pretty easy to re-implement 
in Java (including it in the exisiting POIFS container detector). I've dropped 
an email to the author of that project to check they're happy with us doing that

> DGN parser
> --
>
> Key: TIKA-3721
> URL: https://issues.apache.org/jira/browse/TIKA-3721
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> Does anyone have any experience with the DGN file format by MicroStation? I 
> see TIKA doesn't have a parser so would it be possible to create one? 
> https://docs.fileformat.com/cad/dgn/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-21 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525747#comment-17525747
 ] 

Nick Burch commented on TIKA-3719:
--

Those look like the steps needed. I'd suggest we create ours as something like

{{{color:#445588}keytool{color}{color:#00} -genkeypair -alias 
tika-ssl-testing -keyalg RSA -keysize 2048 -keypass tika-secret -storepass 
tika-secret -validity  -keystore test-ssl.keystore.p12 -storetype PKCS12 
-ext SAN=DNS:localhost,IP:127.0.0.1 -dname "CN=localhost, OU=Tika 
Testing"{color}}}

That will create a PKCS12 formatted keystore with a self-signed key+cert, 
password of tika-secret, which can then be loaded for a test server{{{}{}}}

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)

2022-04-21 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525588#comment-17525588
 ] 

Nick Burch commented on TIKA-3725:
--

Something like OAuth would be pretty different to basic auth, due to the need 
to do all the redirects. SSL client auth would be different again.

Maybe just focus on basic auth with username and password to start with? If so, 
I'd lean towards an interface which takes username + password and returns 
true/false. Then have a single implementation which supports a single username 
and password, username defaults to Tika and can be changed with ENV variable or 
config, password always required from ENV variable or config. Supporting a DB 
of user details (even if only .htpasswd style or like tomcat-users.xml) feels 
an overkill for v1

That's assuming we can't just find some CXF plugin to do it all for us

> Add Authorization to Tika Server (Suggest Basic to start off with)
> --
>
> Key: TIKA-3725
> URL: https://issues.apache.org/jira/browse/TIKA-3725
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> I would be good to get some Authentication/Authorization added to TIKA server 
> to be able to add another layer of security around the Tika Server Rest 
> service.
> This could become a rabbit hole with the number of options available around 
> Authentication/Authorization (Oauth, OpenId etc) so suggest as a starter 
> basic Auth is added. 
> How to store user(s)/password suggest looking at how other apache products do 
> the same?  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-21 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525578#comment-17525578
 ] 

Nick Burch commented on TIKA-3719:
--

For testing it, I'd be tempted to create a self-signed certificate for 
localhost valid for eg 30 years, with a well known password, and pop that into 
test/resources. Then have a test that starts the server passing in that, 
verifies it starts and does a call without error with all the ssl validation 
(eg untrusted) turned off. Likely to be simpler than doing it "properly" with a 
test CA issuing a test cert and a test verifying the cert with the CA.

Happy to create such a keystore if it'd help, it'd pretty similar to what you 
need to do for Alfresco+SOLR so I've got notes somewhere on that!

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-19 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524718#comment-17524718
 ] 

Nick Burch commented on TIKA-3721:
--

After a quick look, I can't spot any free tools or libraries for working with 
these files. OpenDGN appears to not use our normal sense of open, and seems to 
want an expensive SDK license

Did find a nice document on the DWG file format on the new OpenDGN site - 
[https://www.opendesign.com/files/guestdownloads/OpenDesign_Specification_for_.dwg_files.pdf]
 - but nothing for the DGN format there that I can find

If you're able to locate a tool or library, we can look at adding support. 
Alternately if your company has licensed the SDK, it's fairly easy for you to 
build your own custom Tika parser to wrap it, see 
https://tika.apache.org/2.3.0/parser_guide.html

> DGN parser
> --
>
> Key: TIKA-3721
> URL: https://issues.apache.org/jira/browse/TIKA-3721
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> Does anyone have any experience with the DGN file format by MicroStation? I 
> see TIKA doesn't have a parser so would it be possible to create one? 
> https://docs.fileformat.com/cad/dgn/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-05 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517818#comment-17517818
 ] 

Nick Burch commented on TIKA-3571:
--

It has been a quite a while since I last used jodconverter, but the underlying 
OpenOffice would crash or infinite loop rather more often than you'd normally 
like. Docker and a restart watchdog ought to help with that though!

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-03 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516459#comment-17516459
 ] 

Nick Burch commented on TIKA-3711:
--

I'd lean towards putting the file name as an attribute of the img tag, along 
with the description as the alt text if the format supports it

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3696) Add detection for wacz files

2022-03-10 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504378#comment-17504378
 ] 

Nick Burch commented on TIKA-3696:
--

Shouldn't it be more like {{application/x-wacz}}  since it isn't a standard / 
official one?

> Add detection for wacz files
> 
>
> Key: TIKA-3696
> URL: https://issues.apache.org/jira/browse/TIKA-3696
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> https://webrecorder.github.io/wacz-spec/1.2.0/
> Zip file with standard entries: 'archive', 'datapackage.json', 'indexes' and 
> 'pages'.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-10 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504150#comment-17504150
 ] 

Nick Burch commented on TIKA-3684:
--

Same as Tika 2.x - pass a {{--config}} flag when you start the server

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-07 Thread Nick Burch (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-3694.
--
Fix Version/s: 2.3.1
   Resolution: Fixed

> Tika Server endpoint to return more details on a mime type
> --
>
> Key: TIKA-3694
> URL: https://issues.apache.org/jira/browse/TIKA-3694
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, server
>Affects Versions: 2.3.0
>Reporter: Nick Burch
>Priority: Major
> Fix For: 2.3.1
>
>
> As raised on the user list - 
> [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users 
> calling the Java APIs are able to get additional details on a mime type, such 
> as common extensions and descriptions. Those calling the Tika Server can only 
> get limited information on mime types, such as which are known to Tika
> In addition to the current {{/mime-types}} endpoint (html/json/text), we 
> should add a more detailed one that takes a specific type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-07 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502627#comment-17502627
 ] 

Nick Burch commented on TIKA-3694:
--

I've added new HTML and JSON endpoints {{/mime-types/type/subtype}} which 
return additional details on the specified type (or 404 if unknown), eg
{code:java}
{
  "extensions" : [ ".cbor" ],
  "acronym" : "CBOR",
  "alias" : [ ],
  "description" : "Concise Binary Object Representation container",
  "links" : [ "http://tools.ietf.org/html/rfc7049; ],
  "type" : "application/cbor",
  "defaultExtension" : ".cbor"
}{code}
On the basis that people may have custom parsing around the all-types Text and 
JSON endpoints, no change made there to the output. The all-types HTML endpoint 
now returns a little bit more info, and links to the full details one.

> Tika Server endpoint to return more details on a mime type
> --
>
> Key: TIKA-3694
> URL: https://issues.apache.org/jira/browse/TIKA-3694
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, server
>Affects Versions: 2.3.0
>Reporter: Nick Burch
>Priority: Major
>
> As raised on the user list - 
> [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users 
> calling the Java APIs are able to get additional details on a mime type, such 
> as common extensions and descriptions. Those calling the Tika Server can only 
> get limited information on mime types, such as which are known to Tika
> In addition to the current {{/mime-types}} endpoint (html/json/text), we 
> should add a more detailed one that takes a specific type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-07 Thread Nick Burch (Jira)

Nick Burch created TIKA-3694:


 Summary: Tika Server endpoint to return more details on a mime type
 Key: TIKA-3694
 URL: https://issues.apache.org/jira/browse/TIKA-3694
 Project: Tika
  Issue Type: Improvement
  Components: mime, server
Affects Versions: 2.3.0
Reporter: Nick Burch


As raised on the user list - 
[https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users 
calling the Java APIs are able to get additional details on a mime type, such 
as common extensions and descriptions. Those calling the Tika Server can only 
get limited information on mime types, such as which are known to Tika

In addition to the current {{/mime-types}} endpoint (html/json/text), we should 
add a more detailed one that takes a specific type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3686) CSS file detected as JavaScript (application/javascript)

2022-03-03 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500804#comment-17500804
 ] 

Nick Burch commented on TIKA-3686:
--

Detecting types of text-based files with magic is always going to fail for some 
cases. There are no sure-fire things to match on, only guesses

If you're sure that your files have the right extensions on them, just ask Tika 
to detect by filename only, no contents

> CSS file detected as JavaScript (application/javascript)
> 
>
> Key: TIKA-3686
> URL: https://issues.apache.org/jira/browse/TIKA-3686
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.0.0-ALPHA
>Reporter: Marius Dumitru Florea
>Priority: Major
>
> The following CSS file 
> [https://github.com/techlab/jquery-smartwizard/blob/v5.1.1/dist/css/smart_wizard_all.min.css]
>  is detected as {{application/javascript}} using:
> {noformat}
> TikaUtils.detect(InputStream stream, String name)
> {noformat}
> The reason seems to be that the CSS file starts with:
> {noformat}
> /*!
>  * jQuery
> {noformat}
> which matches the "jQuery" entry from 
> [tika-mimetypes.xml|https://github.com/apache/tika/blob/2.3.0/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L348]
>  used by Tika's {{MimeTypes}} detector.
> This is a regression introduced by 
> https://github.com/apache/tika/commit/97699598f000139b1222b785d634b3c8a8e216c7
>  in TIKA-1141 (2.0.0-ALPHA).
> The implications are serious if the mime type returned by Tika is used to set 
> the content type on the HTTP request returning the CSS file to the browser: 
> the browser ignores the CSS.
> FTR, in my case the CSS file is not served directly from the file system but 
> from a WebJar (in this case 
> https://search.maven.org/artifact/org.webjars.npm/smartwizard/5.1.1/jar ) and 
> we're using Tika to determine the type of files requested from the WebJars.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3676) Consider making dl4j dependencies provided

2022-02-09 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489597#comment-17489597
 ] 

Nick Burch commented on TIKA-3676:
--

As long as we provide sensible instructions on what to do, I'm happy to make 
this like our other "large bundle of native code" case for sqlite and require 
users to add the relevant pom entry for their platform / kitchen sink it 
themselves

> Consider making dl4j dependencies provided
> --
>
> Key: TIKA-3676
> URL: https://issues.apache.org/jira/browse/TIKA-3676
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> Dl4j dependencies are rather large.  We can cut ~4-6 minutes off the build 
> time and prevent gigabytes transferring over various networks during the 
> release cycle (at least).  With the recent upgrade to dl4j, the jar is now 
> 1.4GB, up from ~800MB in our 1.x branch.
> We are currently packaging the kitchen-sink, e.g. every platform's native 
> libraries.  For folks using our wrappers/parsers around dl4j, they can a) 
> easily include the dependencies that are "provided" or b) tailor their 
> dependencies for their OS/architecture.
> What do you think? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-24 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480955#comment-17480955
 ] 

Nick Burch commented on TIKA-3656:
--

That POM is your problem, you aren't including any of the container aware 
dependencies which comes with the Parsers

Try adding a dependency such as tika-parsers-standard or 
tika-parser-microsoft-module

> Tika returns wrong content type for docx types.
> ---
>
> Key: TIKA-3656
> URL: https://issues.apache.org/jira/browse/TIKA-3656
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: Windows 10, Java 1.8
>Reporter: Ajesh
>Priority: Major
>
> Steps to reproduce
>  # Select a DOCX file say example.docx
>  # Rename the DOCX file to PDF say example.pdf
>  # Use Tika to detect the content type of the example.pdf file
>  # Returns application/zip instead  
> application/vnd.openxmlformats-officedocument.wordprocessingml.document



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-21 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479981#comment-17479981
 ] 

Nick Burch commented on TIKA-3656:
--

How are you calling Tika? And do you have the office parsers on your classpath 
along with all their dependencies?

> Tika returns wrong content type for docx types.
> ---
>
> Key: TIKA-3656
> URL: https://issues.apache.org/jira/browse/TIKA-3656
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: Windows 10, Java 1.8
>Reporter: Ajesh
>Priority: Major
>
> Steps to reproduce
>  # Select a DOCX file say example.docx
>  # Rename the DOCX file to PDF say example.pdf
>  # Use Tika to detect the content type of the example.pdf file
>  # Returns application/zip instead  
> application/vnd.openxmlformats-officedocument.wordprocessingml.document



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3646) MP4 files have their mime type detected as video/quicktime

2022-01-13 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17475269#comment-17475269
 ] 

Nick Burch commented on TIKA-3646:
--

I think this is probably the same issue as TIKA-2935 - the same work described 
there still needs to be done by someone who has the time + energy + interest...

> MP4 files have their mime type detected as video/quicktime
> --
>
> Key: TIKA-3646
> URL: https://issues.apache.org/jira/browse/TIKA-3646
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Reporter: Apachae Tika User
>Priority: Major
> Attachments: Video.mp4
>
>
> I was using ScreenToGif tool which allos to record screen and create gifs or 
> MP4 files (with ffmpeg). I've tried to use Tika Detector for such files but 
> the file is being detected as  video/quicktime with .qt extension. How is 
> that?
> Attaching small video for example which was generated with ScreenToGif and 
> saved as mp4.
> I see some other people complaining for same thing here
> [https://stackoverflow.com/questions/48021617/use-apache-tika-get-mp4-file-contenttype-got-video-quicktime]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3590) OSX DMG files wrong MIME type detection (wrong MediaType and Supertype)

2021-11-16 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444644#comment-17444644
 ] 

Nick Burch commented on TIKA-3590:
--

[~salmira] Are you able to create us a few sample dmg files to test with? 
Ideally with our standard set of contents for compressed / package formats, eg 
test-documents.zip or test-documents.tar from 
[https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/test/resources/test-documents]

[~tallison] I don't think our current structure will allow us to have one mime 
type with multiple parents, such as this where there is one "format" that is 
actually many different formats all sharing the same official mime type and 
extension. We could potentially do something nasty, and have subtypes which 
reproduce the zlib and bzip2 magics along with some sort of compressed DMG 
header, but it'd be tricky. We can't just do standard zlib or bzip2 magic, 
otherwise it'll trump the real formats, so would need to be the compressed 
outer layer magic _plus_ some sort of inner magic too. Unless we did something 
horribly evil, and had {{application/x-apple-diskimage; compression=zlib}} 
which defined a parent of zlib and not officially 
{{application/x-apple-diskimage}} - not sure if the parent and parameter 
matching would let us get away with that double magic would be 
safer/cleaner, do need some test files to check it all with!

> OSX DMG files wrong MIME type detection (wrong MediaType and Supertype)
> ---
>
> Key: TIKA-3590
> URL: https://issues.apache.org/jira/browse/TIKA-3590
> Project: Tika
>  Issue Type: Bug
>  Components: core, detector
>Affects Versions: 1.26, 1.27, 2.0.0-ALPHA, 2.0.0-BETA, 2.1.0
>Reporter: Tetiana Tvardovska
>Priority: Major
>
> Calling {{mimeSupport.detectMimeTypes}} for  OSX DMG files returns a wrong 
> value.
> DMG files are detected as MIME type: {{*"application/zlib"*}} or 
> *{{"application/x-bzip"}}*
> instead of expected: *{{"application/x-apple-diskimage".}}*
>  
> Error is caused by {{getSupertype}} method which returns a wrong type (too 
> "super" {{{}MediaType.OCTET_STREAM){}}}for OSX DMG files instead of  
> {{{}*"application/zlib" or* {*}"application/x-bzip"{*}{*}{*}{}}}.
>  
> For information, DMG mime type is correctly detected when debugging the  
> method
>  
> {code:java}
> org/apache/tika/mime/MimeTypes.java:484  public MediaType detect(...
> 522:  MimeType hint = getMimeType(name); 
> {code}
>   the {{hint}} value gets a correct *{{"application/x-apple-diskimage"}}* 
> value here.
> But later the {{hint}} value is not taken into consideration for 
> {{possibleTypes}}  as {{applyHint}} results:
>  
> {code:java}
> 529:  possibleTypes = applyHint(possibleTypes, hint);{code}
>  
> This wrong value is returned to : 
>  
> {code:java}
> repository/org/apache/tika/tika-core/1.26/tika-core-1.26-sources.jar!/org/apache/tika/detect/CompositeDetector.java:84
> MediaType detected = detector.detect(input, metadata);
> if (registry.isSpecializationOf(detected, type)) {
> type = detected;
> }
> {code}
>  
>  
> h3. Possible solution -Add a more precise Supertype detection for 
> "{{{}*application/x-apple-diskimage*{}}}" type
> Just add one more verification into the 
> {{{}MediaTypeRegistry.{}}}{{getSupertype}} method, for example, in a 
> 'diff'-like format:
> {{org/apache/tika/tika-core/1.26/tika-core-1.26-sources.jar}}
> {{org/apache/tika/mime/MediaTypeRegistry.java:187}}
>  
> {code:java}
> public MediaType getSupertype(MediaType type) {
>  ...
> +} else if (type.getSubtype().endsWith("x-apple-diskimage")) { 
> +returnMediaType.application("x-bzip");
> +}
> ...
> }
> {code}
>  
> or
> {code:java}
> public MediaType getSupertype(MediaType type) {
>  ...
> +} else if (type.getSubtype().endsWith("x-apple-diskimage")) { 
> +        return MediaType.APPLICATION_ZIP;
> +}
> ...
> }
> {code}
>  
>  
> ---
> Tested at project [Sonatype Nexus|https://github.com/sonatype/nexus-public/] 
> {{release-3.36.0-01 }}for RAW repository with a "Strict Content Type 
> Validation" set ON when trying to upload *.dmg files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3582) Tika does not respect a configuration value passed over a HTTP Header

2021-10-26 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434493#comment-17434493
 ] 

Nick Burch commented on TIKA-3582:
--

Bit fiddly, but how about a config option on the server for the minimum (with a 
sensible default), and the header never lets you go below that minimum?

> Tika does not respect a configuration value passed over a HTTP Header
> -
>
> Key: TIKA-3582
> URL: https://issues.apache.org/jira/browse/TIKA-3582
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 2.1.0
>Reporter: dataminer.accolade
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: sampleimage.png
>
>
>  
> I think the value of TikaServerConfig.TaskTimeoutMillis should be overridden 
> for the current request over *X-Tika-OCRTimeoutSeconds* header. The following 
> request takes more than 120 seconds.
> *curl -vvv -X PUT -T sampleimage.png http://localhost:9998/tika --header 
> "X-Tika-OCRTimeoutSeconds: 600"*
>  
> Tesserect is configured with tessdata_best models



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3570) LYR file detection

2021-10-13 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428246#comment-17428246
 ] 

Nick Burch commented on TIKA-3570:
--

[~delmaestro_l] Does that sample file load in the program that generated it?

Apache POI (which Tika uses for OLE2 / Office files) can't read it - 
{{Exception in thread "main" java.lang.IndexOutOfBoundsException: Block 
538976259 not found}} - and the python library OleFileIO_PL can't read it 
either - {{IOError: OLE sector index out of range}}

Those two errors both seems to suggest the file is corrupted or truncated

> LYR file detection
> --
>
> Key: TIKA-3570
> URL: https://issues.apache.org/jira/browse/TIKA-3570
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, mime
>Affects Versions: 1.7
>Reporter: Laura Delmaestro
>Priority: Minor
> Attachments: sample.lyr
>
>
> Tika 1.7 returns for .lyr files: application/x-tika-msoffice.
> Is it possibile to have a more specific response about .lyr format?
> More details can be found at 
> [https://www.filetypeadvisor.com/it/extension/lyr]
> We could not find the mime type of this kind of files
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3570) LYR file detection

2021-10-12 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427920#comment-17427920
 ] 

Nick Burch commented on TIKA-3570:
--

Do you have a small sample file that you can share with us, ideally one we can 
use in testing + unit testing?

>From what is described it sounds like the LYR format is based on the OLE2 
>container format, so we need an example to work out the key identifying 
>directory/document nodes for detection

> LYR file detection
> --
>
> Key: TIKA-3570
> URL: https://issues.apache.org/jira/browse/TIKA-3570
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, mime
>Affects Versions: 1.7
>Reporter: Laura Delmaestro
>Priority: Minor
>
> Tika 1.7 returns for .lyr files: application/x-tika-msoffice.
> Is it possibile to have a more specific response about .lyr format?
> More details can be found at 
> [https://www.filetypeadvisor.com/it/extension/lyr]
> We could not find the mime type of this kind of files
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3559) Add MIME type for .webmanifest files

2021-09-22 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418531#comment-17418531
 ] 

Nick Burch commented on TIKA-3559:
--

I'm not sure if the example in the spec is under a suitable license. I think 
code snippets in MDN are - based on 
[https://developer.mozilla.org/en-US/docs/MDN/About] - so potentially we could 
use that one for testing. Ideally need a second opinion though!

> Add MIME type for .webmanifest files
> 
>
> Key: TIKA-3559
> URL: https://issues.apache.org/jira/browse/TIKA-3559
> Project: Tika
>  Issue Type: New Feature
>Reporter: Olle Jonsson
>Priority: Minor
>
> Hello!
> This issue is a proposal to add the MIME type for [Web 
> Manifest|https://www.w3.org/TR/appmanifest/].
> * [Description of the new MIME type in the W3 "Web Application Manifest" 
> editor's draft.|https://www.w3.org/TR/appmanifest/#media-type-registration]
> * [Example of usage in the MDN article on Web 
> Manifest|https://developer.mozilla.org/en-US/docs/Web/Manifest#deploying_a_manifest_with_the_link_tag]
> * [JSON Schema for it|https://w3c.github.io/manifest/#json-schema]
> *Example implementation*: What it could look like the XML in the Tika 
> project: 
> {code:xml}
> 
>   <_comment>Web Application Manifest file
>   
>   
> 
>  
> {code}
> I was looking through 
> https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
>  and could not find any such mention.
>  
> *Example Web Manifest* file from the MDN page:
> It contains related_applications, marked at-risk - 
> https://w3c.github.io/manifest/#related_applications-member
> {code:json}
> {
>   "name": "HackerWeb",
>   "short_name": "HackerWeb",
>   "start_url": ".",
>   "display": "standalone",
>   "background_color": "#fff",
>   "description": "A readable Hacker News app.",
>   "icons": [{
> "src": "images/touch/homescreen48.png",
> "sizes": "48x48",
> "type": "image/png"
>   }, {
> "src": "images/touch/homescreen72.png",
> "sizes": "72x72",
> "type": "image/png"
>   }, {
> "src": "images/touch/homescreen96.png",
> "sizes": "96x96",
> "type": "image/png"
>   }, {
> "src": "images/touch/homescreen144.png",
> "sizes": "144x144",
> "type": "image/png"
>   }, {
> "src": "images/touch/homescreen168.png",
> "sizes": "168x168",
> "type": "image/png"
>   }, {
> "src": "images/touch/homescreen192.png",
> "sizes": "192x192",
> "type": "image/png"
>   }],
>   "related_applications": [{
> "platform": "play",
> "url": "https://play.google.com/store/apps/details?id=cheeaun.hackerweb;
>   }]
> }
> {code}
> *Example manifest* from the spec:
> {code:json}
> {
>   "lang": "en",
>   "dir": "ltr",
>   "name": "Super Racer 3000",
>   "short_name": "Racer3K",
>   "icons": [{
> "src": "icon/lowres.webp",
> "sizes": "64x64",
> "type": "image/webp"
>   }, {
> "src": "icon/lowres.png",
> "sizes": "64x64"
>   }, {
> "src": "icon/hd_hi",
> "sizes": "128x128"
>   }],
>   "scope": "/",
>   "start_url": "/start.html",
>   "display": "fullscreen",
>   "orientation": "landscape",
>   "theme_color": "aliceblue",
>   "background_color": "red"
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3559) Add MIME type for .webmanifest files

2021-09-22 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418505#comment-17418505
 ] 

Nick Burch commented on TIKA-3559:
--

As we get more JSON-based formats, I wonder if we should do a detector for them?

Looks like there are no required elements, if I have read the spec right, which 
makes it a bit tricky. But we probably could write a detector that parses the 
json, and if it sees the most common few keys then detects the type, similar to 
how the zip detector works on entry filenames

Otherwise [~olleolleolle] , any chance of a small example manifest file we can 
use for unit testing the detection?

> Add MIME type for .webmanifest files
> 
>
> Key: TIKA-3559
> URL: https://issues.apache.org/jira/browse/TIKA-3559
> Project: Tika
>  Issue Type: New Feature
>Reporter: Olle Jonsson
>Priority: Minor
>
> Hello!
> This issue is a proposal to add the MIME type for [Web 
> Manifest|https://www.w3.org/TR/appmanifest/].
> * [Description of the new MIME type in the W3 "Web Application Manifest" 
> editor's draft.|https://www.w3.org/TR/appmanifest/#media-type-registration]
> * [Example of usage in the MDN article on Web 
> Manifest|https://developer.mozilla.org/en-US/docs/Web/Manifest#deploying_a_manifest_with_the_link_tag]
> *Example implementation*: What it could look like the XML in the Tika 
> project: 
> {code:xml}
> 
>   <_comment>Web Application Manifest file
>   
>   
> 
>  
> {code}
> I was looking through 
> https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
>  and could not find any such mention.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3558) vulnerability detected in vorbis-tika-java

2021-09-21 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418157#comment-17418157
 ] 

Nick Burch commented on TIKA-3558:
--

That seems to be a vulnerability in the libflac C code, so shouldn't affect the 
library we use as that's pure Java and a fresh implementation

In terms of the library not having any recent releases, generally the basics 
are all there and nicely stable, but there is still more that could be 
implemented if any volunteers wanted to assist!

There's improvements needed in how to map metadata from files with multiple 
substreams (eg video + multiple audio), improving multi-stream detection using 
Ogg Skeleton / Annodex or CMML, extracting song lyrics from Kate streams etc.

> vulnerability detected in vorbis-tika-java
> --
>
> Key: TIKA-3558
> URL: https://issues.apache.org/jira/browse/TIKA-3558
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.27
>Reporter: brent jackson
>Priority: Major
>
> we recently had a user report that a security scan on tika-app-1.25 
> discovered a vulernability in vorbis-tika-java. specifically:
>  
> [https://nvd.nist.gov/vuln/detail/CVE-2017-6888]
> (detected on 
> tika-app-1.25.jar/META-INF/maven/org.gagravarr/vorbis-java-tika/pom.xml)
>  
> i checked 1.27 and the org.gagravarr classes have not been updated (they all 
> date from 2016).  has this vulnerability been addressed? or is it a false 
> positive? thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3554) Detect plain text file as application/zip based on file ext wrong

2021-09-16 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416044#comment-17416044
 ] 

Nick Burch commented on TIKA-3554:
--

Just to emphasise what Tim has written, file type detection in Apache Tika is 
based on a "do the best we can, and if in doubt assume it is something".

If you are dealing with potentially malicious files from untrusted submissions, 
and you only want to let certain things through, you likely need to use a 
different and more paranoid library. You don't want to find eg someone slaps a 
Zip header then 5kb of nulls on the front of a PDF, Tika thinks it is a 
slightly broken Zip, but your helpful+vulnerable PDF reader skips the first 5kb 
to find the PDF header then cheerily loads the file

> Detect plain text file as application/zip based on file ext wrong
> -
>
> Key: TIKA-3554
> URL: https://issues.apache.org/jira/browse/TIKA-3554
> Project: Tika
>  Issue Type: Bug
>  Components: detector, metadata, mime
>Affects Versions: 1.26
>Reporter: Krisztián Gyula Tóth
>Priority: Major
>  Labels: mime-type
> Attachments: image-2021-09-15-10-33-33-560.png
>
>
> *Update:* Tika detect only gets 3400bytes peeked from the input stream (and 
> the file name) and not the entire file's byte array.
> 
> *Given* a simple plain text file with the file extension `.zip` and with 
> content `Hello World!`. Example file name: "hello.txt.zip"
> *When* calling the function `tika.detect()` with the file bytes from an 
> `InputStream` using `BufferedInputStream`
> {code:java}
> String detectedMimeType = tika.detect(bytes.get(), fileItem.getName());
> {code}
> *Then* it returns `application/zip` as for the detected MimeType. (Regardless 
> the file's content is in plain text (~12byte), only the file extension 
> contains the `.zip`.)
>  
> Note: The result is the same for file with HTML content, but also 
> having`.zip` as file ext. It’s not a super rare file type that’s hard to 
> detect. So I’d say it’s a bug in Tika.
>  
> *Expected behavior*
> Tika should detect the provided file as a plain text file and return 
> `text/plain` for the detected mime type regardless of the file extension 
> being `.zip`.
>  
> *Suggested solution:*
> Check file signature further on the file extension in case the file ext is 
> `.zip`
> To ensure that the uploaded file is really a zip archive, it should have a 
> matching file signature with one of the following:
>  * 50 4B 03 04
>  * 50 4B 05 06 (empty archive)
>  * 50 4B 07 08 (spanned archive)
> See magic numbers at Wiki page for ZIP file format: 
> [https://en.wikipedia.org/wiki/ZIP_(file_format)|https://en.wikipedia.org/wiki/ZIP_(file_format))]
>  
> *Background info:*
> We are using the `Tika.detect()` to detect the file's mime type on uploading 
> to the server in a Java servlet before saving it for further processing. To 
> ensure that the client-provided file has the expected mime type and accepts 
> only that type of file. In this context, we are working with `ZIP` archives. 
> Users are only allowed to upload zip archives. But, it turned out that Tika 
> cannot detect plain text files and still recognizes them as ZIP archives if 
> the file extension is given as {{`.zip}}`.
> However, there are newer versions of the Apache Tika than what we are 
> currently using is 1.26 this is still an issue in the newer version.
>  
> *How do I investigate this:*
>  # A valid zip archive with filename `archive.zip.txt` where the file 
> extension is `.txt`
>  ** Expectation: Tika should detect the file mime type as `application/zip`
>  ** Result: Provides the expected result. A valid zip archive, but with 
> having the file `.txt` file extension in its name is still detected as 
> `application/zip` successfully.
>  # A valid zip archive with filename, but without the `.zip` file extension.
>  ** Expectation: Tika should detect the file mime type as `application/zip`
>  ** Result: Provides the expected result. A valid zip archive, but without 
> having the file `.zip` file extension in its name is still detected as 
> `application/zip` successfully.
>  #  A common GIF file, but with `.zip` file extension `something.gif.zip`
>  ** Expectation: Tika should detect the file mime type as `image/gif`
>  ** Result: Provides the expected result. A GIF image, but with having the 
> file `.zip` extension is still can be detected as `image/gif`
>  # Any plain text file (can be `HTML` doc or `TEXT`) with filename 
> `myText.zip` where the file extension is `.zip`
>  ** Expectation: Tika should detect the file mime type as 
> `application/octet-stream` in general or `text/plain` or `text/html` 
> depending on the file's content.
>  ** Result: Tika `detect()` **fails**! Detects it as `application/zip`.
>  # Any

[jira] [Commented] (TIKA-3554) Detect plain text file as application/zip based on file ext wrong

2021-09-16 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416009#comment-17416009
 ] 

Nick Burch commented on TIKA-3554:
--

If possible, wrap your {{InputStream}} as a {{TikaInputStream}} before passing 
to Tika. If you actually have a File, wrap that as a {{TikaInputStream}} before 
passing

> Detect plain text file as application/zip based on file ext wrong
> -
>
> Key: TIKA-3554
> URL: https://issues.apache.org/jira/browse/TIKA-3554
> Project: Tika
>  Issue Type: Bug
>  Components: detector, metadata, mime
>Affects Versions: 1.26
>Reporter: Krisztián Gyula Tóth
>Priority: Major
>  Labels: mime-type
> Attachments: image-2021-09-15-10-33-33-560.png
>
>
> *Update:* Tika detect only gets 3400bytes peeked from the input stream (and 
> the file name) and not the entire file's byte array.
> 
> *Given* a simple plain text file with the file extension `.zip` and with 
> content `Hello World!`. Example file name: "hello.txt.zip"
> *When* calling the function `tika.detect()` with the file bytes from an 
> `InputStream` using `BufferedInputStream`
> {code:java}
> String detectedMimeType = tika.detect(bytes.get(), fileItem.getName());
> {code}
> *Then* it returns `application/zip` as for the detected MimeType. (Regardless 
> the file's content is in plain text (~12byte), only the file extension 
> contains the `.zip`.)
>  
> Note: The result is the same for file with HTML content, but also 
> having`.zip` as file ext. It’s not a super rare file type that’s hard to 
> detect. So I’d say it’s a bug in Tika.
>  
> *Expected behavior*
> Tika should detect the provided file as a plain text file and return 
> `text/plain` for the detected mime type regardless of the file extension 
> being `.zip`.
>  
> *Suggested solution:*
> Check file signature further on the file extension in case the file ext is 
> `.zip`
> To ensure that the uploaded file is really a zip archive, it should have a 
> matching file signature with one of the following:
>  * 50 4B 03 04
>  * 50 4B 05 06 (empty archive)
>  * 50 4B 07 08 (spanned archive)
> See magic numbers at Wiki page for ZIP file format: 
> [https://en.wikipedia.org/wiki/ZIP_(file_format)|https://en.wikipedia.org/wiki/ZIP_(file_format))]
>  
> *Background info:*
> We are using the `Tika.detect()` to detect the file's mime type on uploading 
> to the server in a Java servlet before saving it for further processing. To 
> ensure that the client-provided file has the expected mime type and accepts 
> only that type of file. In this context, we are working with `ZIP` archives. 
> Users are only allowed to upload zip archives. But, it turned out that Tika 
> cannot detect plain text files and still recognizes them as ZIP archives if 
> the file extension is given as {{`.zip}}`.
> However, there are newer versions of the Apache Tika than what we are 
> currently using is 1.26 this is still an issue in the newer version.
>  
> *How do I investigate this:*
>  # A valid zip archive with filename `archive.zip.txt` where the file 
> extension is `.txt`
>  ** Expectation: Tika should detect the file mime type as `application/zip`
>  ** Result: Provides the expected result. A valid zip archive, but with 
> having the file `.txt` file extension in its name is still detected as 
> `application/zip` successfully.
>  # A valid zip archive with filename, but without the `.zip` file extension.
>  ** Expectation: Tika should detect the file mime type as `application/zip`
>  ** Result: Provides the expected result. A valid zip archive, but without 
> having the file `.zip` file extension in its name is still detected as 
> `application/zip` successfully.
>  #  A common GIF file, but with `.zip` file extension `something.gif.zip`
>  ** Expectation: Tika should detect the file mime type as `image/gif`
>  ** Result: Provides the expected result. A GIF image, but with having the 
> file `.zip` extension is still can be detected as `image/gif`
>  # Any plain text file (can be `HTML` doc or `TEXT`) with filename 
> `myText.zip` where the file extension is `.zip`
>  ** Expectation: Tika should detect the file mime type as 
> `application/octet-stream` in general or `text/plain` or `text/html` 
> depending on the file's content.
>  ** Result: Tika `detect()` **fails**! Detects it as `application/zip`.
>  # Any plain text file (can be `HTML` doc or plain `TEXT`) with filename, but 
> without the file extension.
>  ** Expectation: Tika should detect the file mime type as 
> `application/octet-stream` in general or `text/plain` or `text/html` 
> depending on the file's content.
>  ** Result: Provides the expected result. Detects it as 
> `application/octet-stream`. (So to say it's acceptable for a file without 
> file extension and text

[jira] [Commented] (TIKA-3555) Eset antivirus found threat in the GitHub repo after Git clone

2021-09-15 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415782#comment-17415782
 ] 

Nick Burch commented on TIKA-3555:
--

Doesn't that make us look more dodgy, and more likely to trigger an on-access 
or in-memory virus scanner block?

I'd lean more towards putting these kinds of files in a dangerously named 
subdirectory with a little readme that says something like "we handle these 
properly, but other programs don't, so take care if opening with any tools 
other than Tika"

> Eset antivirus found threat in the GitHub repo after Git clone
> --
>
> Key: TIKA-3555
> URL: https://issues.apache.org/jira/browse/TIKA-3555
> Project: Tika
>  Issue Type: Bug
>Reporter: Krisztián Gyula Tóth
>Priority: Major
> Attachments: eset_tika_alert.png, tika-suspicious-file.png
>
>
> I've just cloned this GitHub repo  [https://github.com/apache/tika]  when I 
> saw the popup from ESET antivirus on my machine.
> {code:java}
> Real-time file system protection - Threat
> Alert triggered on computer:
> C:\Git\GitHub\tika\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pkg-module\src\test\resources\test-documents\droste.zip
> contains Archbomb.ZIP trojan.
> {code}
> See the attached screenshots.
>  
> Is this a real threat in the repo or false alarm? Could you please do a 
> security scan?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3554) Detect plain text file as application/zip based on file ext wrong

2021-09-15 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415460#comment-17415460
 ] 

Nick Burch commented on TIKA-3554:
--

If you want Apache Tika to do detection only on the file contents (magic + 
detectors if present), call the detector without including the filename

Where the rough type is known, Apache Tika will use the filename to specialise 
the type

Where know specific type is known, Apache Tika will use the filename as the 
sole guide, which is what is happening with your simple short text file

> Detect plain text file as application/zip based on file ext wrong
> -
>
> Key: TIKA-3554
> URL: https://issues.apache.org/jira/browse/TIKA-3554
> Project: Tika
>  Issue Type: Bug
>  Components: detector, metadata, mime
>Affects Versions: 1.26
>Reporter: Krisztián Gyula Tóth
>Priority: Major
>  Labels: mime-type
> Attachments: image-2021-09-15-10-33-33-560.png
>
>
> *Given* a simple plain text file with the file extension `.zip` and with 
> content `Hello World!`. Example file name: "hello.txt.zip"
> *When* calling the function `tika.detect()` with the file bytes from an 
> `InputStream` using `BufferedInputStream`
> {code:java}
> String detectedMimeType = tika.detect(bytes.get(), fileItem.getName());
> {code}
> *Then* it returns `application/zip` as for the detected MimeType. (Regardless 
> the file's content is in plain text, only the file extension contains the 
> `.zip`.)
>  
> Note: The result is the same for file with HTML content, but also 
> having`.zip` as file ext. It’s not a super rare file type that’s hard to 
> detect. So I’d say it’s a bug in Tika.
>  
> *Expected behavior*
> Tika should detect the provided file as a plain text file and return 
> `text/plain` for the detected mime type regardless of the file extension 
> being `.zip`.
>  
> *Suggested solution:*
> Check file signature further on the file extension in case the file ext is 
> `.zip`
> To ensure that the uploaded file is really a zip archive, it should have a 
> matching file signature with one of the following:
>  * 50 4B 03 04
>  * 50 4B 05 06 (empty archive)
>  * 50 4B 07 08 (spanned archive)
> See magic numbers at Wiki page for ZIP file format: 
> [https://en.wikipedia.org/wiki/ZIP_(file_format)|https://en.wikipedia.org/wiki/ZIP_(file_format))]
>  
> *Background info:*
> We are using the `Tika.detect()` to detect the file's mime type on uploading 
> to the server in a Java servlet before saving it for further processing. To 
> ensure that the client-provided file has the expected mime type and accepts 
> only that type of file. In this context, we are working with `ZIP` archives. 
> Users are only allowed to upload zip archives. But, it turned out that Tika 
> cannot detect plain text files and still recognizes them as ZIP archives if 
> the file extension is given as {{`.zip}}`.
> However, there are newer versions of the Apache Tika than what we are 
> currently using is 1.26 this is still an issue in the newer version.
>  
> *How do I investigate this:*
>  # A valid zip archive with filename `archive.zip.txt` where the file 
> extension is `.txt`
>  ** Expectation: Tika should detect the file mime type as `application/zip`
>  ** Result: Provides the expected result. A valid zip archive, but with 
> having the file `.txt` file extension in its name is still detected as 
> `application/zip` successfully.
>  # A valid zip archive with filename, but without the `.zip` file extension.
>  ** Expectation: Tika should detect the file mime type as `application/zip`
>  ** Result: Provides the expected result. A valid zip archive, but without 
> having the file `.zip` file extension in its name is still detected as 
> `application/zip` successfully.
>  #  A common GIF file, but with `.zip` file extension `something.gif.zip`
>  ** Expectation: Tika should detect the file mime type as `image/gif`
>  ** Result: Provides the expected result. A GIF image, but with having the 
> file `.zip` extension is still can be detected as `image/gif`
>  # Any plain text file (can be `HTML` doc or `TEXT`) with filename 
> `myText.zip` where the file extension is `.zip`
>  ** Expectation: Tika should detect the file mime type as 
> `application/octet-stream` in general or `text/plain` or `text/html` 
> depending on the file's content.
>  ** Result: Tika `detect()` **fails**! Detects it as `application/zip`.
>  # Any plain text file (can be `HTML` doc or plain `TEXT`) with filename, but 
> without the file extension.
>  ** Expectation: Tika should detect the file mime type as 
> `application/octet-stream` in general or `text/plain` or `text/html` 
> depending on the file's content.
>  ** Result: Provides the expected result. Detects it as 
> `application/octet-stream`. (So to

[jira] [Commented] (TIKA-3555) Eset antivirus found threat in the GitHub repo after Git clone

2021-09-15 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415439#comment-17415439
 ] 

Nick Burch commented on TIKA-3555:
--

See TIKA-259

This file will make an underpowered computer unhappy if you try to unpack it, 
but this is a safe example of a class of problematic files used to test that 
Tika is correctly handling this kind of issue safely. It isn't a trojan

> Eset antivirus found threat in the GitHub repo after Git clone
> --
>
> Key: TIKA-3555
> URL: https://issues.apache.org/jira/browse/TIKA-3555
> Project: Tika
>  Issue Type: Bug
>Reporter: Krisztián Gyula Tóth
>Priority: Major
> Attachments: eset_tika_alert.png, tika-suspicious-file.png
>
>
> I've just cloned this GitHub repo  [https://github.com/apache/tika]  when I 
> saw the popup from ESET antivirus on my machine.
> {code:java}
> Real-time file system protection - Threat
> Alert triggered on computer:
> C:\Git\GitHub\tika\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pkg-module\src\test\resources\test-documents\droste.zip
> contains Archbomb.ZIP trojan.
> {code}
> See the attached screenshots.
>  
> Is this a real threat in the repo or false alarm? Could you please do a 
> security scan?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411814#comment-17411814
 ] 

Nick Burch commented on TIKA-3544:
--

Apache POI provides the DataFormatter class which attempts to turn the number 
into a string similar to the one shown in Excel, based on the formatting rules 
applied to the cell. That ought to be being used by Tika. Doesn't help 
completely if Excel has thrown away the last few digits though...

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411774#comment-17411774
 ] 

Nick Burch commented on TIKA-3544:
--

You need to be aware that Excel itself only stored numbers-as-numbers with a 
certain amount of precision (~15 digits). Any very long numbers will always 
risk having data and precision lost if stored as a number in Excel. You need to 
store those as strings (eg with a ' prefix) to avoid data loss

See 
[https://www.microsoft.com/en-us/microsoft-365/blog/2008/04/10/understanding-floating-point-precision-aka-why-does-excel-give-me-seemingly-wrong-answers/]
 for more info on this from Microsoft that you may wish to share with the 
people generating your spreadsheets with the risk of data loss

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3534) Latest Android Studio will fail building Android project with Tika Core 2.0.0 included - issues with MethodHandle API usage

2021-08-22 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402788#comment-17402788
 ] 

Nick Burch commented on TIKA-3534:
--

This class is used by the bits of Apache Tika (mostly parsers) that use Java 
NIO byte buffers, to work around limitations in the core Java library for 
releasing them when no longer required. It is required and will not be removed

However, if you aren't using the Parser classes on Android, you are fine to 
exclude the class when you package your application. If you are using the 
Parsers, you should look into how to replicate this functionality on Android

> Latest Android Studio will fail building Android project with Tika Core 2.0.0 
> included - issues with MethodHandle API usage
> ---
>
> Key: TIKA-3534
> URL: https://issues.apache.org/jira/browse/TIKA-3534
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>Reporter: Andrei Dobrescu
>Priority: Major
>
> I use Tika Core on top of my Android projects in order to detect mime types 
> of files.
> Recently, build started to fail with this error:
> {code:java}
> com.android.tools.r8.internal.m1: MethodHandle.invoke and 
> MethodHandle.invokeExact are only supported starting with Android O 
> (--min-api 26)
> {code}
> MethodHandle API was included in Android Oreo. I could set minimum SDK 
> version to Oreo, but I still have users with Android L,M,N (API level 21 to 
> 25).
> By exploring Tika Core source code, I observed that the MethodHandle API is 
> only used only on this class: 
> [org.apache.tika.io.MappedBufferCleaner|[https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/io/MappedBufferCleaner.java]]
>  This class doesn't seem to be used anywhere else in the project. It also 
> contains some huge hacks, with sun.misc.Unsafe (which is also unavailable on 
> Android).
>  
> What is the purpose of this class? Why is it here?
> Why is not used anywhere else inside the project?
> Can you please remove this class on the next Tika release?
>  
> Dirty working workaround:
> In theory, one can use JarJar gradle plugin to modify contents of the 
> imported Tika Core jar dependency:
> {code:java}
> buildscript {
> dependencies {
> classpath 'org.anarres.jarjar:jarjar-gradle:1.0.1'
> }
> }
> dependencies {
> implementation jarjar.repackage {
> from 'org.apache.tika:tika-core:2.0.0'
> classDelete 'org.apache.tika.io.MappedBufferCleaner'
> }
> }
> {code}
> However, JarJar gradle plugin is a bit outdated and I couldn't make it work. 
> An alternative would be to tell gradle to download the jar file, create a 
> modified jar file that includes all of the original jar's contents, excluding 
> MappedBufferCleaner, then import the modified jar file:
> {code:java}
> task tikaAndroidJar(type: Zip) {
> if (!buildDir.exists())
> buildDir.mkdir()
> def originalJarFile = new File("$buildDir/tika-core-original.jar")
> def originalJarUrl = 
> 'https://repo1.maven.org/maven2/org/apache/tika/tika-core/2.0.0/tika-core-2.0.0.jar'
> new URL(originalJarUrl).withInputStream { i -> 
> originalJarFile.withOutputStream{ o -> o << i }}
> from zipTree(originalJarFile)
> include '**/*.class'
> exclude 'org/apache/tika/io/MappedBufferCleaner.class'
> exclude 'org/apache/tika/io/MappedBufferCleaner$BufferCleaner.class'
> archiveName 'tika-core-modified.jar'
> destinationDir(file("$buildDir/"))
> }
> dependencies {
> implementation files("$buildDir/tika-core-modified.jar") {
> builtBy "tikaAndroidJar"
> }
> }
> {code}
> Still, it's an ugly solution. The library works fine without 
> MappedBufferCleaner class, the project builds and at runtime it can detect 
> mime types of files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1480 matches

Mail list logo