[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390381#comment-16390381
 ] 

Hudson commented on TIKA-1518:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #216 (See 
[https://builds.apache.org/job/tika-2.x-windows/216/])
TIKA-1518 -- turn dockerfile-maven-plugin back on.  Accidentally (tallison: rev 
ca19696657cca2ec83160f9a16cbb36bfc35cde6)
* (edit) tika-server/pom.xml


> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390350#comment-16390350
 ] 

Hudson commented on TIKA-2594:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #6 (See 
[https://builds.apache.org/job/tika-branch-1x/6/])
TIKA-2594 improve eml detection via Luis Filipe Nassif (tallison: 
[https://github.com/apache/tika/commit/e12117c0e4792404eca825df0d2ae9925f0d5d18])
* (edit) tika-server/pom.xml
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> Mail detected as application/xhtml+xml
> --
>
> Key: TIKA-2594
> URL: https://issues.apache.org/jira/browse/TIKA-2594
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Major
> Fix For: 1.18, 2.0.0
>
> Attachments: TestMail_inline_xhtml_plus_image.eml
>
>
> The attached mail (message/rfc822) with inline xhtml is recognized as 
> application/xhtml+xml
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390351#comment-16390351
 ] 

Hudson commented on TIKA-1518:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #6 (See 
[https://builds.apache.org/job/tika-branch-1x/6/])
TIKA-1518: Detach docker file build from build phase in Maven execution (david: 
[https://github.com/apache/tika/commit/42aa774f1e1d232ee9f98b58ace9f0417231716b])
* (edit) tika-server/pom.xml


> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390348#comment-16390348
 ] 

Hudson commented on TIKA-2590:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #6 (See 
[https://builds.apache.org/job/tika-branch-1x/6/])
TIKA-2590 update Changes.txt (tallison: 
[https://github.com/apache/tika/commit/c566cc472a4c9daf1e99fb80de9df2390b342350])
* (edit) CHANGES.txt


> ExcelExtractor: cannot choose listening to the selected records only
> 
>
> Key: TIKA-2590
> URL: https://issues.apache.org/jira/browse/TIKA-2590
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Grigoriy Alekseev
>Priority: Critical
> Fix For: 1.18, 2.0.0
>
>
> The listenForAllRecords argument is being always reset to 'true', so the 
> 'else' branch is never reached. It may cause incorrect text extraction when 
> records with certain unsupported types (e.g. SharedFormula) are present in a 
> file.
> {code:java}
> public void processFile(DirectoryNode root, boolean 
> listenForAllRecords)
> throws IOException, SAXException, TikaException {
> // Set up listener and register the records we want to process
> HSSFRequest hssfRequest = new HSSFRequest();
> listenForAllRecords = true;
> if (listenForAllRecords) {
> hssfRequest.addListenerForAllRecords(formatListener);
> } else {
> hssfRequest.addListener(formatListener, BOFRecord.sid);
> hssfRequest.addListener(formatListener, EOFRecord.sid);
> hssfRequest.addListener(formatListener, 
> DateWindow1904Record.sid);
> hssfRequest.addListener(formatListener, CountryRecord.sid);
> hssfRequest.addListener(formatListener, BoundSheetRecord.sid);
> hssfRequest.addListener(formatListener, SSTRecord.sid);
> hssfRequest.addListener(formatListener, FormulaRecord.sid);
> hssfRequest.addListener(formatListener, LabelRecord.sid);
> hssfRequest.addListener(formatListener, LabelSSTRecord.sid);
> hssfRequest.addListener(formatListener, NumberRecord.sid);
> hssfRequest.addListener(formatListener, RKRecord.sid);
> hssfRequest.addListener(formatListener, StringRecord.sid);
> hssfRequest.addListener(formatListener, HyperlinkRecord.sid);
> hssfRequest.addListener(formatListener, TextObjectRecord.sid);
> hssfRequest.addListener(formatListener, SeriesTextRecord.sid);
> hssfRequest.addListener(formatListener, FormatRecord.sid);
> hssfRequest.addListener(formatListener, 
> ExtendedFormatRecord.sid);
> hssfRequest.addListener(formatListener, 
> DrawingGroupRecord.sid);
> if 
> (extractor.officeParserConfig.getIncludeHeadersAndFooters()) {
> hssfRequest.addListener(formatListener, HeaderRecord.sid);
> hssfRequest.addListener(formatListener, FooterRecord.sid);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2527) Typos in tika-mimetypes.xml

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390349#comment-16390349
 ] 

Hudson commented on TIKA-2527:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #6 (See 
[https://builds.apache.org/job/tika-branch-1x/6/])
TIKA-2527 -- Various new mimes and typo fixes in tika-mimetypes.xml via 
(tallison: 
[https://github.com/apache/tika/commit/33f756fa4581ae3d1643ea7299121139a5c1bc6d])
* (edit) CHANGES.txt
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> Typos in tika-mimetypes.xml
> ---
>
> Key: TIKA-2527
> URL: https://issues.apache.org/jira/browse/TIKA-2527
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0, 1.16, 1.17, 1.18
> Environment: ALL
>Reporter: Andreas Meier
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
> Attachments: enhancement-for-TIKA2527-contributed-by-AMeier.patch, 
> fix-for-TIKA2527-contributed-by-AMeier-Fixed-adpcmmi.patch, 
> fix-for-binhexmatch-TIKA2527-contributed-by-AMeier.patch
>
>
> Are these mimetypes in tika-mimetypes.xml
> audio/x-adbcm instead audio/x-adpcm
> {code:xml} {code}
> and
> audio/x-dec-adbcm  instead audio/x-dec-adpcm
> {code:xml} {code}
> intended?
> Couldn't find these mimetypes.
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390335#comment-16390335
 ] 

Hudson commented on TIKA-1518:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1454 (See 
[https://builds.apache.org/job/Tika-trunk/1454/])
TIKA-1518: Detach docker file build from build phase in Maven execution (david: 
[https://github.com/apache/tika/commit/deb9e96f29d3a322804016d4533bb76de7c40e2c])
* (edit) tika-server/pom.xml
TIKA-1518 -- turn dockerfile-maven-plugin back on.  Accidentally (tallison: 
[https://github.com/apache/tika/commit/ca19696657cca2ec83160f9a16cbb36bfc35cde6])
* (edit) tika-server/pom.xml


> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2527) Typos in tika-mimetypes.xml

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390333#comment-16390333
 ] 

Hudson commented on TIKA-2527:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1454 (See 
[https://builds.apache.org/job/Tika-trunk/1454/])
TIKA-2527 -- Various new mimes and typo fixes in tika-mimetypes.xml via 
(tallison: 
[https://github.com/apache/tika/commit/9b7154cf37871f5ef0874e972ec9208538e15e44])
* (edit) CHANGES.txt
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> Typos in tika-mimetypes.xml
> ---
>
> Key: TIKA-2527
> URL: https://issues.apache.org/jira/browse/TIKA-2527
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0, 1.16, 1.17, 1.18
> Environment: ALL
>Reporter: Andreas Meier
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
> Attachments: enhancement-for-TIKA2527-contributed-by-AMeier.patch, 
> fix-for-TIKA2527-contributed-by-AMeier-Fixed-adpcmmi.patch, 
> fix-for-binhexmatch-TIKA2527-contributed-by-AMeier.patch
>
>
> Are these mimetypes in tika-mimetypes.xml
> audio/x-adbcm instead audio/x-adpcm
> {code:xml} {code}
> and
> audio/x-dec-adbcm  instead audio/x-dec-adpcm
> {code:xml} {code}
> intended?
> Couldn't find these mimetypes.
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390334#comment-16390334
 ] 

Hudson commented on TIKA-2594:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1454 (See 
[https://builds.apache.org/job/Tika-trunk/1454/])
TIKA-2594 improve eml detection via Luis Filipe Nassif (tallison: 
[https://github.com/apache/tika/commit/9c0a822419797f20a09388ccd235c7e70db9])
* (edit) tika-server/pom.xml
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> Mail detected as application/xhtml+xml
> --
>
> Key: TIKA-2594
> URL: https://issues.apache.org/jira/browse/TIKA-2594
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Major
> Fix For: 1.18, 2.0.0
>
> Attachments: TestMail_inline_xhtml_plus_image.eml
>
>
> The attached mail (message/rfc822) with inline xhtml is recognized as 
> application/xhtml+xml
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390317#comment-16390317
 ] 

Dave Meikle commented on TIKA-1518:
---

[~talli...@mitre.org] - ah it looks like the proxy settings aren't being passed 
into the Docker container.

Normally I've passed proxy settings via buildArgs to docker but I am not sure 
how this is handled by the Maven plugin.  I've not done docker behind a proxy 
for a while.

Can you try -X on the maven command to see what is being set?

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390308#comment-16390308
 ] 

Hudson commented on TIKA-2594:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #215 (See 
[https://builds.apache.org/job/tika-2.x-windows/215/])
TIKA-2594 improve eml detection via Luis Filipe Nassif (tallison: rev 
9c0a822419797f20a09388ccd235c7e70db9)
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (edit) tika-server/pom.xml


> Mail detected as application/xhtml+xml
> --
>
> Key: TIKA-2594
> URL: https://issues.apache.org/jira/browse/TIKA-2594
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Major
> Fix For: 1.18, 2.0.0
>
> Attachments: TestMail_inline_xhtml_plus_image.eml
>
>
> The attached mail (message/rfc822) with inline xhtml is recognized as 
> application/xhtml+xml
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Tika 1.18?

2018-03-07 Thread Chris Mattmann
Sounds good to me thanks Tim. Happy to line it up with PDF Box 2.0.9


On 3/7/18, 1:16 PM, "Allison, Timothy B."  wrote:

All,

  I think I've made the updates that I wanted to make sure got in to 1.18.  
It looks like PDFBox is going to start their release cycle shortly.  Should we 
wait for PDFBox 2.0.9?

  That may add a week or two to our release, although, frankly, it might 
not.  We can start running the regression tests March 9(ish) and see if 
anything dire appears...

  Cheers,

  Tim






[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390309#comment-16390309
 ] 

Hudson commented on TIKA-1518:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #215 (See 
[https://builds.apache.org/job/tika-2.x-windows/215/])
TIKA-1518: Detach docker file build from build phase in Maven execution (david: 
rev deb9e96f29d3a322804016d4533bb76de7c40e2c)
* (edit) tika-server/pom.xml


> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2527) Typos in tika-mimetypes.xml

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390307#comment-16390307
 ] 

Hudson commented on TIKA-2527:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #215 (See 
[https://builds.apache.org/job/tika-2.x-windows/215/])
TIKA-2527 -- Various new mimes and typo fixes in tika-mimetypes.xml via 
(tallison: rev 9b7154cf37871f5ef0874e972ec9208538e15e44)
* (edit) CHANGES.txt
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> Typos in tika-mimetypes.xml
> ---
>
> Key: TIKA-2527
> URL: https://issues.apache.org/jira/browse/TIKA-2527
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0, 1.16, 1.17, 1.18
> Environment: ALL
>Reporter: Andreas Meier
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
> Attachments: enhancement-for-TIKA2527-contributed-by-AMeier.patch, 
> fix-for-TIKA2527-contributed-by-AMeier-Fixed-adpcmmi.patch, 
> fix-for-binhexmatch-TIKA2527-contributed-by-AMeier.patch
>
>
> Are these mimetypes in tika-mimetypes.xml
> audio/x-adbcm instead audio/x-adpcm
> {code:xml} {code}
> and
> audio/x-dec-adbcm  instead audio/x-dec-adpcm
> {code:xml} {code}
> intended?
> Couldn't find these mimetypes.
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390284#comment-16390284
 ] 

Dave Meikle edited comment on TIKA-1518 at 3/7/18 9:41 PM:
---

It is a choice we have to make. There are three mains routes to Docker 
packaging that I have used:
 # Automated builds that pull in pre-packaged and then get bundled into an 
image on any change in the an repository - like what we are doing n 
docker-tikaserver approach where is goes and downloads the signed JARs
 # Automated builds that compile the code in the image (e.g. using the maven 
Docker image) and then package them
 # Building a release image and then distributing that - which is what this 
does but requires us to decide when an official release is available and push 
it somewhere

The first and second are really good for leveraging things like Docker Hub to 
automatically build from your repository, where as the third means you have to 
have Docker on your machine when you want to build an image.

I never really like number two as it means the builds are always recompiles of 
the code each time a change is triggered, so you can easily be packing up 
different code as the same version without realising it.

The challenge with the approach in docker-tikaserver is maintaining when assets 
that are being pulled in move - i.e. when an release JAR is move from 
dist.apache.org - but that could easily be solved by going to Nexus for the 
JARs based on the release packages.

I personally quite like the third approach as it means you explicit create an 
image that has its own life and was thinking that we could potentially add this 
to the release process, pushing the image from the release build to Docker 
Hub/Nexus/Another Repos so it is an official build.  So just like when we do a 
mvn release we can go to tika-server and do a mvn dockerfile:build and if happy 
mvn dockerfile:push (once we bottom out where).

Not sure what others think?


was (Author: davemeikle):
It is a choice we have to make. There are three mains routes to Docker 
packaging that I have used:
 # Automated builds that pull in pre-packaged and then get bundled into an 
image on any change in the an repository - like what we are doing n 
docker-tikaserver approach where is goes and downloads the signed JARs
 # Automated builds that compile the code in the image (e.g. using the maven 
Docker image) and then package them
 # Building a release image and then distributing that - which is what this 
does but requires us to decide when an official release is available and push 
it somewhere

The first and second are really good for leveraging things like Docker Hub to 
automatically build from your repository, where as the third means you have to 
have Docker on your machine when you want to build an image.

I never really like number two as it means the builds are always recompiles of 
the code each time a change is triggered, so you can easily be packing up 
different code as the same version without realising it.

The challenge with the approach in docker-tikaserver is maintaining when assets 
that are being pulled in move - i.e. when an release JAR is move from 
dist.apache.org - but that could easily be solved by going to Nexus for the 
JARs based on the release packages.

I personally quite like the third approach as it means you explicit create an 
image that has its own life and was thinking that we could potentially add this 
to the release process, pushing the image from the release build to Docker 
Hub/Nexus/Another Repos so it is an official build.

Not sure what others think?

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390284#comment-16390284
 ] 

Dave Meikle commented on TIKA-1518:
---

It is a choice we have to make. There are three mains routes to Docker 
packaging that I have used:
 # Automated builds that pull in pre-packaged and then get bundled into an 
image on any change in the an repository - like what we are doing n 
docker-tikaserver approach where is goes and downloads the signed JARs
 # Automated builds that compile the code in the image (e.g. using the maven 
Docker image) and then package them
 # Building a release image and then distributing that - which is what this 
does but requires us to decide when an official release is available and push 
it somewhere

The first and second are really good for leveraging things like Docker Hub to 
automatically build from your repository, where as the third means you have to 
have Docker on your machine when you want to build an image.

I never really like number two as it means the builds are always recompiles of 
the code each time a change is triggered, so you can easily be packing up 
different code as the same version without realising it.

The challenge with the approach in docker-tikaserver is maintaining when assets 
that are being pulled in move - i.e. when an release JAR is move from 
dist.apache.org - but that could easily be solved by going to Nexus for the 
JARs based on the release packages.

I personally quite like the third approach as it means you explicit create an 
image that has its own life and was thinking that we could potentially add this 
to the release process, pushing the image from the release build to Docker 
Hub/Nexus/Another Repos so it is an official build.

Not sure what others think?

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390282#comment-16390282
 ] 

Tim Allison commented on TIKA-2592:
---

{quote}I already have a small testset I run tika against (~300k+ files), that 
is also the reason for the numerous tickets I created lately.
{quote}
Great, and thank you!
{quote}Too many people and nightly builds stressing one vm may be too much.
{quote}
As long as you aren't active during release cycles, we won't stress is much. :D

 

Finally, if you want to get involved with the tika-eval module and/or if you 
have any code you've found helpful in evaluating different runs or single runs, 
let us know! 

> HTML with charset unicode handled as utf-16 instead utf-8
> -
>
> Key: TIKA-2592
> URL: https://issues.apache.org/jira/browse/TIKA-2592
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
> Attachments: IANA Charset names.txt, 
> StandardCharsets_unsupported_by_IANA.txt, TestCharsetUnicodeHTML.html, 
> TestHTMLCharsetArabicCP1256.html, TestHTMLCharsetCP1256.html, 
> fix-for-TIKA2592-contributed-by-Andreas-Meier.patch
>
>
> HTML files are detected as utf-16 when meta content is set to "unicode".
> {code:XML}
> 
>  {code}
>  
> Shouldn't the default be utf-8?
> The attached sample file is shown correctly in:
> Chromium Version 55.0.2883.75
> Firefox 50.1.0
> IE 11
> I am aware that there is no charset "unicode" (available character encodings: 
> [http://www.iana.org/assignments/character-sets/character-sets.xhtml|http://www.iana.org/assignments/character-sets/character-sets.xhtml])
> Unfortunately there are many wrong encodings used out there.
> All unknown encodings should be validated or at least be set to default utf-8.
> Regards 
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390275#comment-16390275
 ] 

Tim Allison edited comment on TIKA-1518 at 3/7/18 9:33 PM:
---

And sorry for letting the <\!-- --> slip through!!!


was (Author: talli...@mitre.org):
And sorry for letting the  slip through!!!

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390274#comment-16390274
 ] 

Tim Allison commented on TIKA-1518:
---

Your 
[commit|https://github.com/apache/tika/commit/deb9e96f29d3a322804016d4533bb76de7c40e2c#diff-332a9cfb880c4a30e2abc7af93035120]
 sure fixed it by turning it off.  :D  

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390275#comment-16390275
 ] 

Tim Allison commented on TIKA-1518:
---

And sorry for letting the  slip through!!!

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390268#comment-16390268
 ] 

Tim Allison commented on TIKA-1518:
---

Not quite, different error this time (see attached file)...could be user error, 
I have no doubt!

OTOH, do we want to require Docker on devs' computers?

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1518:
--
Attachment: tika-server-docker-err-msg.txt

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2591) Some tiffs (Big Endian with fax compression) are showing up as x-tarr

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390253#comment-16390253
 ] 

Hudson commented on TIKA-2591:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #5 (See 
[https://builds.apache.org/job/tika-branch-1x/5/])
TIKA-2591 -- Add workaround to identify TIFFs that might confuse (tallison: 
[https://github.com/apache/tika/commit/b4047eb2d92ee4ae8d8e02d12079232419775a73])
* (edit) CHANGES.txt
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipContainerDetectorTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java


> Some tiffs (Big Endian with fax compression) are showing up as x-tarr
> -
>
> Key: TIKA-2591
> URL: https://issues.apache.org/jira/browse/TIKA-2591
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.16
> Environment: Tika, running in a java application and a unit-test 
> (windows and mac environments)
>Reporter: daniel schmidt
>Priority: Major
>  Labels: newbie
> Fix For: 1.18, 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have found that a certain tiff that we manage is now reporting 
> application/x-tar in Tika where it previously reported as a tiff 
> (image/tiff). 
> Observe this code in ArchiveStreamFactory, detect method.
>   // COMPRESS-117 - improve auto-recognition
>         if (signatureLength >= TAR_HEADER_SIZE) {
>             TarArchiveInputStream tais = null;
>             try {
>                 tais = new TarArchiveInputStream(new 
> ByteArrayInputStream(tarHeader));
>                 // COMPRESS-191 - verify the header checksum
>                 if (tais.getNextTarEntry().isCheckSumOK()) {
>                     return TAR;
>                 }
>             } catch (final Exception e) { // NOPMD // NOSONAR
>                 // can generate IllegalArgumentException as well
>                 // as IOException
>                 // autodetection, simply not a TAR
>                 // ignored
>             } finally {
>                 IOUtils.closeQuietly(tais);
>             }
> What if find is that most TIFs, when they get to tais.getNextTarEntry() fail 
> with an exception (i.e fall into the "simply not a tar" case). However this 
> tiff actually does NOT fail here. This somewhat makes sense as the internal 
> structure of a fax compressed tifs as a tar-like structure
> Note, the CompositeDetector class eventually does recognize it as a proper 
> tiff as it loops through its detectors in its detect method. It is detected 
> as tiff in the MimeTypes class, which is one of the implementations of the 
> Detector interface
>  
>     public MediaType detect(InputStream input, Metadata metadata)
>             throws IOException {
>         MediaType type = MediaType.OCTET_STREAM;
>         for (Detector detector : getDetectors()) {
>             //short circuit via OverrideDetector
>             //can't rely on ordering because subsequent detector may
>             //change Override's to a specialization of Override's
>             if (detector instanceof OverrideDetector &&        
> metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) {
>                 return detector.detect(input, metadata);
>             }
>             MediaType detected = detector.detect(input, metadata);
>             if (registry.isSpecializationOf(detected, type)) {
>                 type = detected;
>             }
>         }
>         return type;
> However since Image/tiff isn't a specialization of application/x-tar it does 
> not replace the type with tiff.
> My fix was to add a  "" to the 
> definition for image/tiff in the tika-mimetypes.xml file
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390251#comment-16390251
 ] 

Hudson commented on TIKA-2594:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #5 (See 
[https://builds.apache.org/job/tika-branch-1x/5/])
TIKA-2594 -- improve eml detection for those starting with Subject: and 
(tallison: 
[https://github.com/apache/tika/commit/b9e9e5b150aca851465e99017da6328c202ba127])
* (add) 
tika-parsers/src/test/resources/test-documents/testEML_embedded_xhtml_and_img.eml
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java


> Mail detected as application/xhtml+xml
> --
>
> Key: TIKA-2594
> URL: https://issues.apache.org/jira/browse/TIKA-2594
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Major
> Fix For: 1.18, 2.0.0
>
> Attachments: TestMail_inline_xhtml_plus_image.eml
>
>
> The attached mail (message/rfc822) with inline xhtml is recognized as 
> application/xhtml+xml
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390252#comment-16390252
 ] 

Hudson commented on TIKA-2592:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #5 (See 
[https://builds.apache.org/job/tika-branch-1x/5/])
TIKA-2592 -- ignore charsets not supported by IANA in html meta-headers 
(tallison: 
[https://github.com/apache/tika/commit/164c9286fc0933051e86ce0a209250aa51bee3bf])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java
* (edit) CHANGES.txt
* (add) 
tika-parsers/src/main/resources/org/apache/tika/parser/html/StandardCharsets_unsupported_by_IANA.txt
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
* (add) 
tika-parsers/src/test/resources/test-documents/testHTML_charset_utf16le.html
* (add) 
tika-parsers/src/test/resources/test-documents/testHTML_charset_utf8.html


> HTML with charset unicode handled as utf-16 instead utf-8
> -
>
> Key: TIKA-2592
> URL: https://issues.apache.org/jira/browse/TIKA-2592
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
> Attachments: IANA Charset names.txt, 
> StandardCharsets_unsupported_by_IANA.txt, TestCharsetUnicodeHTML.html, 
> TestHTMLCharsetArabicCP1256.html, TestHTMLCharsetCP1256.html, 
> fix-for-TIKA2592-contributed-by-Andreas-Meier.patch
>
>
> HTML files are detected as utf-16 when meta content is set to "unicode".
> {code:XML}
> 
>  {code}
>  
> Shouldn't the default be utf-8?
> The attached sample file is shown correctly in:
> Chromium Version 55.0.2883.75
> Firefox 50.1.0
> IE 11
> I am aware that there is no charset "unicode" (available character encodings: 
> [http://www.iana.org/assignments/character-sets/character-sets.xhtml|http://www.iana.org/assignments/character-sets/character-sets.xhtml])
> Unfortunately there are many wrong encodings used out there.
> All unknown encodings should be validated or at least be set to default utf-8.
> Regards 
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390250#comment-16390250
 ] 

Hudson commented on TIKA-2600:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #5 (See 
[https://builds.apache.org/job/tika-branch-1x/5/])
TIKA-2600 -- remove md5 checksum, and switch sha-1 to sha-512 for (tallison: 
[https://github.com/apache/tika/commit/32c19dee5bd4952f9f041f5fba218130fa02bdb5])
* (edit) pom.xml


> Don't use md5 checksum due to changes to the release distribuition policy
> -
>
> Key: TIKA-2600
> URL: https://issues.apache.org/jira/browse/TIKA-2600
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 1.18, 2.0.0
>
>
> To plagiarize from PDFBOX-4142:
> The release distribution policy was changes with regard to the checksums to 
> be used:
> Old policy :
> MUST provide a MD5-file
> SHOULD provide a SHA-file [SHA-512 recommended]
> New policy :
> MUST provide a SHA- or MD5-file
> SHOULD provide a SHA-file
> SHOULD NOT provide a MD5-file
> see http://www.apache.org/dev/release-distribution for further details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390254#comment-16390254
 ] 

Hudson commented on TIKA-2590:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #5 (See 
[https://builds.apache.org/job/tika-branch-1x/5/])
TIKA-2590 -- revert listenForAllRecords = false thanks to Grigoriy (tallison: 
[https://github.com/apache/tika/commit/a9b4b3676f9476ae78246aa2f962006502243a24])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java


> ExcelExtractor: cannot choose listening to the selected records only
> 
>
> Key: TIKA-2590
> URL: https://issues.apache.org/jira/browse/TIKA-2590
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Grigoriy Alekseev
>Priority: Critical
> Fix For: 1.18, 2.0.0
>
>
> The listenForAllRecords argument is being always reset to 'true', so the 
> 'else' branch is never reached. It may cause incorrect text extraction when 
> records with certain unsupported types (e.g. SharedFormula) are present in a 
> file.
> {code:java}
> public void processFile(DirectoryNode root, boolean 
> listenForAllRecords)
> throws IOException, SAXException, TikaException {
> // Set up listener and register the records we want to process
> HSSFRequest hssfRequest = new HSSFRequest();
> listenForAllRecords = true;
> if (listenForAllRecords) {
> hssfRequest.addListenerForAllRecords(formatListener);
> } else {
> hssfRequest.addListener(formatListener, BOFRecord.sid);
> hssfRequest.addListener(formatListener, EOFRecord.sid);
> hssfRequest.addListener(formatListener, 
> DateWindow1904Record.sid);
> hssfRequest.addListener(formatListener, CountryRecord.sid);
> hssfRequest.addListener(formatListener, BoundSheetRecord.sid);
> hssfRequest.addListener(formatListener, SSTRecord.sid);
> hssfRequest.addListener(formatListener, FormulaRecord.sid);
> hssfRequest.addListener(formatListener, LabelRecord.sid);
> hssfRequest.addListener(formatListener, LabelSSTRecord.sid);
> hssfRequest.addListener(formatListener, NumberRecord.sid);
> hssfRequest.addListener(formatListener, RKRecord.sid);
> hssfRequest.addListener(formatListener, StringRecord.sid);
> hssfRequest.addListener(formatListener, HyperlinkRecord.sid);
> hssfRequest.addListener(formatListener, TextObjectRecord.sid);
> hssfRequest.addListener(formatListener, SeriesTextRecord.sid);
> hssfRequest.addListener(formatListener, FormatRecord.sid);
> hssfRequest.addListener(formatListener, 
> ExtendedFormatRecord.sid);
> hssfRequest.addListener(formatListener, 
> DrawingGroupRecord.sid);
> if 
> (extractor.officeParserConfig.getIncludeHeadersAndFooters()) {
> hssfRequest.addListener(formatListener, HeaderRecord.sid);
> hssfRequest.addListener(formatListener, FooterRecord.sid);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2591) Some tiffs (Big Endian with fax compression) are showing up as x-tarr

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390247#comment-16390247
 ] 

Hudson commented on TIKA-2591:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1453 (See 
[https://builds.apache.org/job/Tika-trunk/1453/])
TIKA-2591 -- Add workaround to identify TIFFs that might confuse (tallison: 
[https://github.com/apache/tika/commit/462ee4744fd426cfdb12539435627b25e789c912])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java
* (edit) CHANGES.txt
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipContainerDetectorTest.java


> Some tiffs (Big Endian with fax compression) are showing up as x-tarr
> -
>
> Key: TIKA-2591
> URL: https://issues.apache.org/jira/browse/TIKA-2591
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.16
> Environment: Tika, running in a java application and a unit-test 
> (windows and mac environments)
>Reporter: daniel schmidt
>Priority: Major
>  Labels: newbie
> Fix For: 1.18, 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have found that a certain tiff that we manage is now reporting 
> application/x-tar in Tika where it previously reported as a tiff 
> (image/tiff). 
> Observe this code in ArchiveStreamFactory, detect method.
>   // COMPRESS-117 - improve auto-recognition
>         if (signatureLength >= TAR_HEADER_SIZE) {
>             TarArchiveInputStream tais = null;
>             try {
>                 tais = new TarArchiveInputStream(new 
> ByteArrayInputStream(tarHeader));
>                 // COMPRESS-191 - verify the header checksum
>                 if (tais.getNextTarEntry().isCheckSumOK()) {
>                     return TAR;
>                 }
>             } catch (final Exception e) { // NOPMD // NOSONAR
>                 // can generate IllegalArgumentException as well
>                 // as IOException
>                 // autodetection, simply not a TAR
>                 // ignored
>             } finally {
>                 IOUtils.closeQuietly(tais);
>             }
> What if find is that most TIFs, when they get to tais.getNextTarEntry() fail 
> with an exception (i.e fall into the "simply not a tar" case). However this 
> tiff actually does NOT fail here. This somewhat makes sense as the internal 
> structure of a fax compressed tifs as a tar-like structure
> Note, the CompositeDetector class eventually does recognize it as a proper 
> tiff as it loops through its detectors in its detect method. It is detected 
> as tiff in the MimeTypes class, which is one of the implementations of the 
> Detector interface
>  
>     public MediaType detect(InputStream input, Metadata metadata)
>             throws IOException {
>         MediaType type = MediaType.OCTET_STREAM;
>         for (Detector detector : getDetectors()) {
>             //short circuit via OverrideDetector
>             //can't rely on ordering because subsequent detector may
>             //change Override's to a specialization of Override's
>             if (detector instanceof OverrideDetector &&        
> metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) {
>                 return detector.detect(input, metadata);
>             }
>             MediaType detected = detector.detect(input, metadata);
>             if (registry.isSpecializationOf(detected, type)) {
>                 type = detected;
>             }
>         }
>         return type;
> However since Image/tiff isn't a specialization of application/x-tar it does 
> not replace the type with tiff.
> My fix was to add a  "" to the 
> definition for image/tiff in the tika-mimetypes.xml file
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390246#comment-16390246
 ] 

Hudson commented on TIKA-2590:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1453 (See 
[https://builds.apache.org/job/Tika-trunk/1453/])
TIKA-2590: restore the client's ability to choose what Excel file (g.alekseev: 
[https://github.com/apache/tika/commit/c56c7c41a6c51e4cd4dac78b693bd883f1329264])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
TIKA-2590 update Changes.txt (tallison: 
[https://github.com/apache/tika/commit/947334cbf40bc6efef1cb488749213724bedb171])
* (edit) CHANGES.txt


> ExcelExtractor: cannot choose listening to the selected records only
> 
>
> Key: TIKA-2590
> URL: https://issues.apache.org/jira/browse/TIKA-2590
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Grigoriy Alekseev
>Priority: Critical
> Fix For: 1.18, 2.0.0
>
>
> The listenForAllRecords argument is being always reset to 'true', so the 
> 'else' branch is never reached. It may cause incorrect text extraction when 
> records with certain unsupported types (e.g. SharedFormula) are present in a 
> file.
> {code:java}
> public void processFile(DirectoryNode root, boolean 
> listenForAllRecords)
> throws IOException, SAXException, TikaException {
> // Set up listener and register the records we want to process
> HSSFRequest hssfRequest = new HSSFRequest();
> listenForAllRecords = true;
> if (listenForAllRecords) {
> hssfRequest.addListenerForAllRecords(formatListener);
> } else {
> hssfRequest.addListener(formatListener, BOFRecord.sid);
> hssfRequest.addListener(formatListener, EOFRecord.sid);
> hssfRequest.addListener(formatListener, 
> DateWindow1904Record.sid);
> hssfRequest.addListener(formatListener, CountryRecord.sid);
> hssfRequest.addListener(formatListener, BoundSheetRecord.sid);
> hssfRequest.addListener(formatListener, SSTRecord.sid);
> hssfRequest.addListener(formatListener, FormulaRecord.sid);
> hssfRequest.addListener(formatListener, LabelRecord.sid);
> hssfRequest.addListener(formatListener, LabelSSTRecord.sid);
> hssfRequest.addListener(formatListener, NumberRecord.sid);
> hssfRequest.addListener(formatListener, RKRecord.sid);
> hssfRequest.addListener(formatListener, StringRecord.sid);
> hssfRequest.addListener(formatListener, HyperlinkRecord.sid);
> hssfRequest.addListener(formatListener, TextObjectRecord.sid);
> hssfRequest.addListener(formatListener, SeriesTextRecord.sid);
> hssfRequest.addListener(formatListener, FormatRecord.sid);
> hssfRequest.addListener(formatListener, 
> ExtendedFormatRecord.sid);
> hssfRequest.addListener(formatListener, 
> DrawingGroupRecord.sid);
> if 
> (extractor.officeParserConfig.getIncludeHeadersAndFooters()) {
> hssfRequest.addListener(formatListener, HeaderRecord.sid);
> hssfRequest.addListener(formatListener, FooterRecord.sid);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390241#comment-16390241
 ] 

Dave Meikle commented on TIKA-1518:
---

{quote}I do have Docker installed, [0] but it is Windows, and I've noticed 
some, um, areas for improvement in Docker on Windows.
{quote}
I've found on Windows I have had to enable the "Expose daemon on 
tcp://localhost:2375 without TLS" in Docker for Windows to talk to it with many 
of the clients out there. Does this solve it for you?

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


RE: Tika 1.18?

2018-03-07 Thread Allison, Timothy B.
All,

  I think I've made the updates that I wanted to make sure got in to 1.18.  It 
looks like PDFBox is going to start their release cycle shortly.  Should we 
wait for PDFBox 2.0.9?

  That may add a week or two to our release, although, frankly, it might not.  
We can start running the regression tests March 9(ish) and see if anything dire 
appears...

  Cheers,

  Tim



[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390216#comment-16390216
 ] 

Tim Allison edited comment on TIKA-1518 at 3/7/18 8:58 PM:
---

bq.  this is me getting too excited

?!
 
I do have Docker installed, [0] but it is Windows, and I've noticed some, um, 
areas for improvement in Docker on Windows.

Thank you!

[0]
{noformat}
C:\stuff>docker -v
Docker version 17.12.0-ce, build c97c6d6
{noformat}


was (Author: talli...@mitre.org):
bq.  this is me getting too excited

?!
 
I do have Docker installed, [0] but it is Windows, and I've noticed some, um, 
areas for improvement in Docker on Windows.

Thank you!

[0]
{noformat}
C:\stuff>docker -v
Docker version 17.12.0-ce, build c97c6d6
{nformat}

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390216#comment-16390216
 ] 

Tim Allison edited comment on TIKA-1518 at 3/7/18 8:58 PM:
---

bq.  this is me getting too excited

?!
 
I do have Docker installed, [0] but it is Windows, and I've noticed some, um, 
areas for improvement in Docker on Windows.

Thank you!

[0]
{noformat}
C:\stuff>docker -v
Docker version 17.12.0-ce, build c97c6d6
{nformat}


was (Author: talli...@mitre.org):
bq.  this is me getting too excited

?!
 
I do have Docker installed, but it is Windows, and I've noticed some, um, areas 
for improvement in Docker on Windows.

Thank you!

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390216#comment-16390216
 ] 

Tim Allison commented on TIKA-1518:
---

bq.  this is me getting too excited

?!
 
I do have Docker installed, but it is Windows, and I've noticed some, um, areas 
for improvement in Docker on Windows.

Thank you!

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390202#comment-16390202
 ] 

Dave Meikle edited comment on TIKA-1518 at 3/7/18 8:51 PM:
---

Sorry [~talli...@mitre.org] - this is me getting too excited. I'll need to 
remove it from being hooked on the "build" phase so those without Docker can 
build without this!

Will do this just now.


was (Author: davemeikle):
Sorry [~talli...@mitre.org] - this is me getting too excited. I'll need to 
remove it from being hooked on the "build" phase so those without Docker can 
build without this!

Will do this just now.

 

 

 

 

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390202#comment-16390202
 ] 

Dave Meikle commented on TIKA-1518:
---

Sorry [~talli...@mitre.org] - this is me getting too excited. I'll need to 
remove it from being hooked on the "build" phase so those without Docker can 
build without this!

Will do this just now.

 

 

 

 

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390197#comment-16390197
 ] 

Tim Allison commented on TIKA-2594:
---

[~lfcnassif], I added the mime defs you suggested above just now to both 2.0.0 
and 1.18.  Thank you!

> Mail detected as application/xhtml+xml
> --
>
> Key: TIKA-2594
> URL: https://issues.apache.org/jira/browse/TIKA-2594
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Major
> Fix For: 1.18, 2.0.0
>
> Attachments: TestMail_inline_xhtml_plus_image.eml
>
>
> The attached mail (message/rfc822) with inline xhtml is recognized as 
> application/xhtml+xml
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2591) Some tiffs (Big Endian with fax compression) are showing up as x-tarr

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390182#comment-16390182
 ] 

Hudson commented on TIKA-2591:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #214 (See 
[https://builds.apache.org/job/tika-2.x-windows/214/])
TIKA-2591 -- Add workaround to identify TIFFs that might confuse (tallison: rev 
462ee4744fd426cfdb12539435627b25e789c912)
* (edit) CHANGES.txt
* (add) 
tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipContainerDetectorTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java


> Some tiffs (Big Endian with fax compression) are showing up as x-tarr
> -
>
> Key: TIKA-2591
> URL: https://issues.apache.org/jira/browse/TIKA-2591
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.16
> Environment: Tika, running in a java application and a unit-test 
> (windows and mac environments)
>Reporter: daniel schmidt
>Priority: Major
>  Labels: newbie
> Fix For: 1.18, 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have found that a certain tiff that we manage is now reporting 
> application/x-tar in Tika where it previously reported as a tiff 
> (image/tiff). 
> Observe this code in ArchiveStreamFactory, detect method.
>   // COMPRESS-117 - improve auto-recognition
>         if (signatureLength >= TAR_HEADER_SIZE) {
>             TarArchiveInputStream tais = null;
>             try {
>                 tais = new TarArchiveInputStream(new 
> ByteArrayInputStream(tarHeader));
>                 // COMPRESS-191 - verify the header checksum
>                 if (tais.getNextTarEntry().isCheckSumOK()) {
>                     return TAR;
>                 }
>             } catch (final Exception e) { // NOPMD // NOSONAR
>                 // can generate IllegalArgumentException as well
>                 // as IOException
>                 // autodetection, simply not a TAR
>                 // ignored
>             } finally {
>                 IOUtils.closeQuietly(tais);
>             }
> What if find is that most TIFs, when they get to tais.getNextTarEntry() fail 
> with an exception (i.e fall into the "simply not a tar" case). However this 
> tiff actually does NOT fail here. This somewhat makes sense as the internal 
> structure of a fax compressed tifs as a tar-like structure
> Note, the CompositeDetector class eventually does recognize it as a proper 
> tiff as it loops through its detectors in its detect method. It is detected 
> as tiff in the MimeTypes class, which is one of the implementations of the 
> Detector interface
>  
>     public MediaType detect(InputStream input, Metadata metadata)
>             throws IOException {
>         MediaType type = MediaType.OCTET_STREAM;
>         for (Detector detector : getDetectors()) {
>             //short circuit via OverrideDetector
>             //can't rely on ordering because subsequent detector may
>             //change Override's to a specialization of Override's
>             if (detector instanceof OverrideDetector &&        
> metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) {
>                 return detector.detect(input, metadata);
>             }
>             MediaType detected = detector.detect(input, metadata);
>             if (registry.isSpecializationOf(detected, type)) {
>                 type = detected;
>             }
>         }
>         return type;
> However since Image/tiff isn't a specialization of application/x-tar it does 
> not replace the type with tiff.
> My fix was to add a  "" to the 
> definition for image/tiff in the tika-mimetypes.xml file
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390181#comment-16390181
 ] 

Hudson commented on TIKA-2590:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #214 (See 
[https://builds.apache.org/job/tika-2.x-windows/214/])
TIKA-2590: restore the client's ability to choose what Excel file (g.alekseev: 
rev c56c7c41a6c51e4cd4dac78b693bd883f1329264)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
TIKA-2590 update Changes.txt (tallison: rev 
947334cbf40bc6efef1cb488749213724bedb171)
* (edit) CHANGES.txt


> ExcelExtractor: cannot choose listening to the selected records only
> 
>
> Key: TIKA-2590
> URL: https://issues.apache.org/jira/browse/TIKA-2590
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Grigoriy Alekseev
>Priority: Critical
> Fix For: 1.18, 2.0.0
>
>
> The listenForAllRecords argument is being always reset to 'true', so the 
> 'else' branch is never reached. It may cause incorrect text extraction when 
> records with certain unsupported types (e.g. SharedFormula) are present in a 
> file.
> {code:java}
> public void processFile(DirectoryNode root, boolean 
> listenForAllRecords)
> throws IOException, SAXException, TikaException {
> // Set up listener and register the records we want to process
> HSSFRequest hssfRequest = new HSSFRequest();
> listenForAllRecords = true;
> if (listenForAllRecords) {
> hssfRequest.addListenerForAllRecords(formatListener);
> } else {
> hssfRequest.addListener(formatListener, BOFRecord.sid);
> hssfRequest.addListener(formatListener, EOFRecord.sid);
> hssfRequest.addListener(formatListener, 
> DateWindow1904Record.sid);
> hssfRequest.addListener(formatListener, CountryRecord.sid);
> hssfRequest.addListener(formatListener, BoundSheetRecord.sid);
> hssfRequest.addListener(formatListener, SSTRecord.sid);
> hssfRequest.addListener(formatListener, FormulaRecord.sid);
> hssfRequest.addListener(formatListener, LabelRecord.sid);
> hssfRequest.addListener(formatListener, LabelSSTRecord.sid);
> hssfRequest.addListener(formatListener, NumberRecord.sid);
> hssfRequest.addListener(formatListener, RKRecord.sid);
> hssfRequest.addListener(formatListener, StringRecord.sid);
> hssfRequest.addListener(formatListener, HyperlinkRecord.sid);
> hssfRequest.addListener(formatListener, TextObjectRecord.sid);
> hssfRequest.addListener(formatListener, SeriesTextRecord.sid);
> hssfRequest.addListener(formatListener, FormatRecord.sid);
> hssfRequest.addListener(formatListener, 
> ExtendedFormatRecord.sid);
> hssfRequest.addListener(formatListener, 
> DrawingGroupRecord.sid);
> if 
> (extractor.officeParserConfig.getIncludeHeadersAndFooters()) {
> hssfRequest.addListener(formatListener, HeaderRecord.sid);
> hssfRequest.addListener(formatListener, FooterRecord.sid);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2527) Typos in tika-mimetypes.xml

2018-03-07 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2527.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.18

Thank you, again, [~AndreasMeier]!

> Typos in tika-mimetypes.xml
> ---
>
> Key: TIKA-2527
> URL: https://issues.apache.org/jira/browse/TIKA-2527
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0, 1.16, 1.17, 1.18
> Environment: ALL
>Reporter: Andreas Meier
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
> Attachments: enhancement-for-TIKA2527-contributed-by-AMeier.patch, 
> fix-for-TIKA2527-contributed-by-AMeier-Fixed-adpcmmi.patch, 
> fix-for-binhexmatch-TIKA2527-contributed-by-AMeier.patch
>
>
> Are these mimetypes in tika-mimetypes.xml
> audio/x-adbcm instead audio/x-adpcm
> {code:xml} {code}
> and
> audio/x-dec-adbcm  instead audio/x-dec-adpcm
> {code:xml} {code}
> intended?
> Couldn't find these mimetypes.
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390150#comment-16390150
 ] 

Hudson commented on TIKA-2592:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1452 (See 
[https://builds.apache.org/job/Tika-trunk/1452/])
TIKA-2592 -- ignore charsets not supported by IANA in html meta-headers 
(tallison: 
[https://github.com/apache/tika/commit/7e2b1e7534268b40c8b4ef3ee20ed708bf2e383c])
* (add) 
tika-parsers/src/test/resources/test-documents/testHTML_charset_utf8.html
* (add) 
tika-parsers/src/main/resources/org/apache/tika/parser/html/StandardCharsets_unsupported_by_IANA.txt
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java
* (add) 
tika-parsers/src/test/resources/test-documents/testHTML_charset_utf16le.html
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
* (edit) CHANGES.txt


> HTML with charset unicode handled as utf-16 instead utf-8
> -
>
> Key: TIKA-2592
> URL: https://issues.apache.org/jira/browse/TIKA-2592
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
> Attachments: IANA Charset names.txt, 
> StandardCharsets_unsupported_by_IANA.txt, TestCharsetUnicodeHTML.html, 
> TestHTMLCharsetArabicCP1256.html, TestHTMLCharsetCP1256.html, 
> fix-for-TIKA2592-contributed-by-Andreas-Meier.patch
>
>
> HTML files are detected as utf-16 when meta content is set to "unicode".
> {code:XML}
> 
>  {code}
>  
> Shouldn't the default be utf-8?
> The attached sample file is shown correctly in:
> Chromium Version 55.0.2883.75
> Firefox 50.1.0
> IE 11
> I am aware that there is no charset "unicode" (available character encodings: 
> [http://www.iana.org/assignments/character-sets/character-sets.xhtml|http://www.iana.org/assignments/character-sets/character-sets.xhtml])
> Unfortunately there are many wrong encodings used out there.
> All unknown encodings should be validated or at least be set to default utf-8.
> Regards 
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only

2018-03-07 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2590.
---
   Resolution: Fixed
Fix Version/s: 1.18

> ExcelExtractor: cannot choose listening to the selected records only
> 
>
> Key: TIKA-2590
> URL: https://issues.apache.org/jira/browse/TIKA-2590
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Grigoriy Alekseev
>Priority: Critical
> Fix For: 1.18, 2.0.0
>
>
> The listenForAllRecords argument is being always reset to 'true', so the 
> 'else' branch is never reached. It may cause incorrect text extraction when 
> records with certain unsupported types (e.g. SharedFormula) are present in a 
> file.
> {code:java}
> public void processFile(DirectoryNode root, boolean 
> listenForAllRecords)
> throws IOException, SAXException, TikaException {
> // Set up listener and register the records we want to process
> HSSFRequest hssfRequest = new HSSFRequest();
> listenForAllRecords = true;
> if (listenForAllRecords) {
> hssfRequest.addListenerForAllRecords(formatListener);
> } else {
> hssfRequest.addListener(formatListener, BOFRecord.sid);
> hssfRequest.addListener(formatListener, EOFRecord.sid);
> hssfRequest.addListener(formatListener, 
> DateWindow1904Record.sid);
> hssfRequest.addListener(formatListener, CountryRecord.sid);
> hssfRequest.addListener(formatListener, BoundSheetRecord.sid);
> hssfRequest.addListener(formatListener, SSTRecord.sid);
> hssfRequest.addListener(formatListener, FormulaRecord.sid);
> hssfRequest.addListener(formatListener, LabelRecord.sid);
> hssfRequest.addListener(formatListener, LabelSSTRecord.sid);
> hssfRequest.addListener(formatListener, NumberRecord.sid);
> hssfRequest.addListener(formatListener, RKRecord.sid);
> hssfRequest.addListener(formatListener, StringRecord.sid);
> hssfRequest.addListener(formatListener, HyperlinkRecord.sid);
> hssfRequest.addListener(formatListener, TextObjectRecord.sid);
> hssfRequest.addListener(formatListener, SeriesTextRecord.sid);
> hssfRequest.addListener(formatListener, FormatRecord.sid);
> hssfRequest.addListener(formatListener, 
> ExtendedFormatRecord.sid);
> hssfRequest.addListener(formatListener, 
> DrawingGroupRecord.sid);
> if 
> (extractor.officeParserConfig.getIncludeHeadersAndFooters()) {
> hssfRequest.addListener(formatListener, HeaderRecord.sid);
> hssfRequest.addListener(formatListener, FooterRecord.sid);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2591) Some tiffs (Big Endian with fax compression) are showing up as x-tarr

2018-03-07 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2591.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Thank you [~schmiddc] and [~gagravarr]!

> Some tiffs (Big Endian with fax compression) are showing up as x-tarr
> -
>
> Key: TIKA-2591
> URL: https://issues.apache.org/jira/browse/TIKA-2591
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.16
> Environment: Tika, running in a java application and a unit-test 
> (windows and mac environments)
>Reporter: daniel schmidt
>Priority: Major
>  Labels: newbie
> Fix For: 1.18, 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have found that a certain tiff that we manage is now reporting 
> application/x-tar in Tika where it previously reported as a tiff 
> (image/tiff). 
> Observe this code in ArchiveStreamFactory, detect method.
>   // COMPRESS-117 - improve auto-recognition
>         if (signatureLength >= TAR_HEADER_SIZE) {
>             TarArchiveInputStream tais = null;
>             try {
>                 tais = new TarArchiveInputStream(new 
> ByteArrayInputStream(tarHeader));
>                 // COMPRESS-191 - verify the header checksum
>                 if (tais.getNextTarEntry().isCheckSumOK()) {
>                     return TAR;
>                 }
>             } catch (final Exception e) { // NOPMD // NOSONAR
>                 // can generate IllegalArgumentException as well
>                 // as IOException
>                 // autodetection, simply not a TAR
>                 // ignored
>             } finally {
>                 IOUtils.closeQuietly(tais);
>             }
> What if find is that most TIFs, when they get to tais.getNextTarEntry() fail 
> with an exception (i.e fall into the "simply not a tar" case). However this 
> tiff actually does NOT fail here. This somewhat makes sense as the internal 
> structure of a fax compressed tifs as a tar-like structure
> Note, the CompositeDetector class eventually does recognize it as a proper 
> tiff as it loops through its detectors in its detect method. It is detected 
> as tiff in the MimeTypes class, which is one of the implementations of the 
> Detector interface
>  
>     public MediaType detect(InputStream input, Metadata metadata)
>             throws IOException {
>         MediaType type = MediaType.OCTET_STREAM;
>         for (Detector detector : getDetectors()) {
>             //short circuit via OverrideDetector
>             //can't rely on ordering because subsequent detector may
>             //change Override's to a specialization of Override's
>             if (detector instanceof OverrideDetector &&        
> metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) {
>                 return detector.detect(input, metadata);
>             }
>             MediaType detected = detector.detect(input, metadata);
>             if (registry.isSpecializationOf(detected, type)) {
>                 type = detected;
>             }
>         }
>         return type;
> However since Image/tiff isn't a specialization of application/x-tar it does 
> not replace the type with tiff.
> My fix was to add a  "" to the 
> definition for image/tiff in the tika-mimetypes.xml file
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-07 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2592.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.18

Thank you [~AndreasMeier] and [~kkrugler]!

> HTML with charset unicode handled as utf-16 instead utf-8
> -
>
> Key: TIKA-2592
> URL: https://issues.apache.org/jira/browse/TIKA-2592
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
> Attachments: IANA Charset names.txt, 
> StandardCharsets_unsupported_by_IANA.txt, TestCharsetUnicodeHTML.html, 
> TestHTMLCharsetArabicCP1256.html, TestHTMLCharsetCP1256.html, 
> fix-for-TIKA2592-contributed-by-Andreas-Meier.patch
>
>
> HTML files are detected as utf-16 when meta content is set to "unicode".
> {code:XML}
> 
>  {code}
>  
> Shouldn't the default be utf-8?
> The attached sample file is shown correctly in:
> Chromium Version 55.0.2883.75
> Firefox 50.1.0
> IE 11
> I am aware that there is no charset "unicode" (available character encodings: 
> [http://www.iana.org/assignments/character-sets/character-sets.xhtml|http://www.iana.org/assignments/character-sets/character-sets.xhtml])
> Unfortunately there are many wrong encodings used out there.
> All unknown encodings should be validated or at least be set to default utf-8.
> Regards 
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2594) Mail detected as application/xhtml+xml

2018-03-07 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2594.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.18

Thank you, [~AndreasMeier]!

> Mail detected as application/xhtml+xml
> --
>
> Key: TIKA-2594
> URL: https://issues.apache.org/jira/browse/TIKA-2594
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Major
> Fix For: 1.18, 2.0.0
>
> Attachments: TestMail_inline_xhtml_plus_image.eml
>
>
> The attached mail (message/rfc822) with inline xhtml is recognized as 
> application/xhtml+xml
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only

2018-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390107#comment-16390107
 ] 

ASF GitHub Bot commented on TIKA-2590:
--

tballison closed pull request #225: TIKA-2590: restore the client's ability to 
choose what Excel file rec…
URL: https://github.com/apache/tika/pull/225
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
index 9146b8c7b..4ea8068de 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
@@ -284,7 +284,6 @@ public void processFile(DirectoryNode root, boolean 
listenForAllRecords)
 
 // Set up listener and register the records we want to process
 HSSFRequest hssfRequest = new HSSFRequest();
-listenForAllRecords = true;
 if (listenForAllRecords) {
 hssfRequest.addListenerForAllRecords(formatListener);
 } else {


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> ExcelExtractor: cannot choose listening to the selected records only
> 
>
> Key: TIKA-2590
> URL: https://issues.apache.org/jira/browse/TIKA-2590
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Grigoriy Alekseev
>Priority: Critical
> Fix For: 2.0.0
>
>
> The listenForAllRecords argument is being always reset to 'true', so the 
> 'else' branch is never reached. It may cause incorrect text extraction when 
> records with certain unsupported types (e.g. SharedFormula) are present in a 
> file.
> {code:java}
> public void processFile(DirectoryNode root, boolean 
> listenForAllRecords)
> throws IOException, SAXException, TikaException {
> // Set up listener and register the records we want to process
> HSSFRequest hssfRequest = new HSSFRequest();
> listenForAllRecords = true;
> if (listenForAllRecords) {
> hssfRequest.addListenerForAllRecords(formatListener);
> } else {
> hssfRequest.addListener(formatListener, BOFRecord.sid);
> hssfRequest.addListener(formatListener, EOFRecord.sid);
> hssfRequest.addListener(formatListener, 
> DateWindow1904Record.sid);
> hssfRequest.addListener(formatListener, CountryRecord.sid);
> hssfRequest.addListener(formatListener, BoundSheetRecord.sid);
> hssfRequest.addListener(formatListener, SSTRecord.sid);
> hssfRequest.addListener(formatListener, FormulaRecord.sid);
> hssfRequest.addListener(formatListener, LabelRecord.sid);
> hssfRequest.addListener(formatListener, LabelSSTRecord.sid);
> hssfRequest.addListener(formatListener, NumberRecord.sid);
> hssfRequest.addListener(formatListener, RKRecord.sid);
> hssfRequest.addListener(formatListener, StringRecord.sid);
> hssfRequest.addListener(formatListener, HyperlinkRecord.sid);
> hssfRequest.addListener(formatListener, TextObjectRecord.sid);
> hssfRequest.addListener(formatListener, SeriesTextRecord.sid);
> hssfRequest.addListener(formatListener, FormatRecord.sid);
> hssfRequest.addListener(formatListener, 
> ExtendedFormatRecord.sid);
> hssfRequest.addListener(formatListener, 
> DrawingGroupRecord.sid);
> if 
> (extractor.officeParserConfig.getIncludeHeadersAndFooters()) {
> hssfRequest.addListener(formatListener, HeaderRecord.sid);
> hssfRequest.addListener(formatListener, FooterRecord.sid);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390087#comment-16390087
 ] 

Hudson commented on TIKA-2592:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #213 (See 
[https://builds.apache.org/job/tika-2.x-windows/213/])
TIKA-2592 -- ignore charsets not supported by IANA in html meta-headers 
(tallison: rev 7e2b1e7534268b40c8b4ef3ee20ed708bf2e383c)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java
* (add) 
tika-parsers/src/test/resources/test-documents/testHTML_charset_utf8.html
* (edit) CHANGES.txt
* (add) 
tika-parsers/src/main/resources/org/apache/tika/parser/html/StandardCharsets_unsupported_by_IANA.txt
* (add) 
tika-parsers/src/test/resources/test-documents/testHTML_charset_utf16le.html
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java


> HTML with charset unicode handled as utf-16 instead utf-8
> -
>
> Key: TIKA-2592
> URL: https://issues.apache.org/jira/browse/TIKA-2592
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Minor
> Attachments: IANA Charset names.txt, 
> StandardCharsets_unsupported_by_IANA.txt, TestCharsetUnicodeHTML.html, 
> TestHTMLCharsetArabicCP1256.html, TestHTMLCharsetCP1256.html, 
> fix-for-TIKA2592-contributed-by-Andreas-Meier.patch
>
>
> HTML files are detected as utf-16 when meta content is set to "unicode".
> {code:XML}
> 
>  {code}
>  
> Shouldn't the default be utf-8?
> The attached sample file is shown correctly in:
> Chromium Version 55.0.2883.75
> Firefox 50.1.0
> IE 11
> I am aware that there is no charset "unicode" (available character encodings: 
> [http://www.iana.org/assignments/character-sets/character-sets.xhtml|http://www.iana.org/assignments/character-sets/character-sets.xhtml])
> Unfortunately there are many wrong encodings used out there.
> All unknown encodings should be validated or at least be set to default utf-8.
> Regards 
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


tika-2.x-windows - Build # 213 - Still Failing

2018-03-07 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #213)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/213/ to 
view the results.

[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390059#comment-16390059
 ] 

Hudson commented on TIKA-2594:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1451 (See 
[https://builds.apache.org/job/Tika-trunk/1451/])
TIKA-2594 -- improve eml detection for those starting with Subject: and 
(tallison: 
[https://github.com/apache/tika/commit/09031046e5bece75ed22d9ee9b184ec49a14f99a])
* (add) 
tika-parsers/src/test/resources/test-documents/testEML_embedded_xhtml_and_img.eml
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java


> Mail detected as application/xhtml+xml
> --
>
> Key: TIKA-2594
> URL: https://issues.apache.org/jira/browse/TIKA-2594
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Major
> Attachments: TestMail_inline_xhtml_plus_image.eml
>
>
> The attached mail (message/rfc822) with inline xhtml is recognized as 
> application/xhtml+xml
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml

2018-03-07 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390028#comment-16390028
 ] 

Luis Filipe Nassif commented on TIKA-2594:
--

We have used that magic restricted to 0:1000 for a long time, with very few 
false positives, along with:

{code}

 
 
 
{code}

> Mail detected as application/xhtml+xml
> --
>
> Key: TIKA-2594
> URL: https://issues.apache.org/jira/browse/TIKA-2594
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Major
> Attachments: TestMail_inline_xhtml_plus_image.eml
>
>
> The attached mail (message/rfc822) with inline xhtml is recognized as 
> application/xhtml+xml
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389944#comment-16389944
 ] 

Hudson commented on TIKA-2594:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #212 (See 
[https://builds.apache.org/job/tika-2.x-windows/212/])
TIKA-2594 -- improve eml detection for those starting with Subject: and 
(tallison: rev 09031046e5bece75ed22d9ee9b184ec49a14f99a)
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
* (add) 
tika-parsers/src/test/resources/test-documents/testEML_embedded_xhtml_and_img.eml


> Mail detected as application/xhtml+xml
> --
>
> Key: TIKA-2594
> URL: https://issues.apache.org/jira/browse/TIKA-2594
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Major
> Attachments: TestMail_inline_xhtml_plus_image.eml
>
>
> The attached mail (message/rfc822) with inline xhtml is recognized as 
> application/xhtml+xml
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


tika-2.x-windows - Build # 212 - Still Failing

2018-03-07 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #212)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/212/ to 
view the results.

[jira] [Commented] (TIKA-1466) Enable overriding of mimetype glob pattern definitions

2018-03-07 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389938#comment-16389938
 ] 

Luis Filipe Nassif commented on TIKA-1466:
--

I thought about logging any custom-mimetype override applied, so the user will 
be warned about that. Maybe additionally creating a specific attribute in 
mimetype definition xml to configure it must override the default one instead 
of aborting. About multiple conflicting custom mimes from different (external) 
projetcs, Tika currently aborts and it is already a problem now.
 
So I think it needs additional discussion and should not be addressed in the 
next release. Will copy/paste this discussion in the jira issue.
 
But I would like to see fixed the detection of MTS videos, but it conflicts 
with another existing mime glob. Any workaround for this specific case? If yes, 
I can open a different ticket.

> Enable overriding of mimetype glob pattern definitions
> --
>
> Key: TIKA-1466
> URL: https://issues.apache.org/jira/browse/TIKA-1466
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.6
>Reporter: Luis Filipe Nassif
>Priority: Major
>
> I think it is important to enable an overriding of the default 
> tika-mimetypes.xml glob pattern definitions within a custom-mimetypes.xml. 
> Currently, you can not define in a custom mimetype an already used glob 
> pattern, even if you redefine in custom-mimetypes.xml the first mimetype 
> using the conflicting glob pattern. The same extension can be used by 
> different applications in different domains or datasets. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2601) Invalid XHTML output for some WORD documents

2018-03-07 Thread Filip (JIRA)
Filip created TIKA-2601:
---

 Summary: Invalid XHTML output for some WORD documents
 Key: TIKA-2601
 URL: https://issues.apache.org/jira/browse/TIKA-2601
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.17
 Environment: Linked is a sample document with its corresponding output.
Reporter: Filip
 Attachments: Test.doc, test.html

In some WORD (.doc, .docx) documents the XHTML elements are not closed 
properly. This usually happens when there are link elements () as well as 
italic or bold elements ().

 

Fix should be done in 
[https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values

2018-03-07 Thread Radim Rehurek (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389884#comment-16389884
 ] 

Radim Rehurek edited comment on TIKA-1020 at 3/7/18 5:57 PM:
-

We just hit this bug too.

I say "bug" because Excel spreadsheets are really structured tables, just like 
[~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes 
little sense.

[~tpalsulich] IMO empty rows could be reported too, but in our use-case, the 
critical thing is not to have jumbled records (caused by missing cells in a 
single row).


was (Author: piskvorky):
We just hit this bug too.

I say "bug" because Excel spreadsheets are really structured tables, just like 
[~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes 
little sense.

[~tpalsulich] empty rows could be reported too, but in our use-case, the 
critical thing is not to have jumbled records (caused by missing cells in a 
single row).

> Excel 2010 parser missing cell values are not reported resulting in missing 
> columns values
> --
>
> Key: TIKA-1020
> URL: https://issues.apache.org/jira/browse/TIKA-1020
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
> Environment: java 1.6 & 1.7 
>Reporter: Neil Blue
>Priority: Major
>  Labels: newbie, patch
>
> When parting an excel 2010 table, if a worksheet has a missing value, then it 
> is not reported in the sax handler. As a result a missing value can result in 
> unordered data.
> For example given the table:
> {code:title=Bar.java|borderStyle=solid}
> A B B
> 1 2 3
> 4   6
> 7 8 9
> {code}
> the returned sax handler reports elements
> {code:title=Bar.java|borderStyle=solid}
> ABC
> 123
> 46
> 789
> {code}
> As a result the handler can detect that the third row as incomplete cell 
> values but it is ambiguous which columns have missing data.
> As a possible fix for this excel 2010 xml data contains the cell reference 
> value, which could be returned to the sax handler as an attribute. 
> {code:title=Bar.java|borderStyle=solid}
> *** XSSFExcelExtractorDecorator.java2012-11-08 10:51:55.881207100 +
> --- XSSFExcelExtractorDecorator.java.1  2012-11-08 10:59:02.972223700 +
> ***
> *** 200,206 
>   
>  public void cell(String cellRef, String formattedValue) {
> try {
> !  xhtml.startElement("td");
>   
>// Main cell contents
>xhtml.characters(formattedValue);
> --- 200,208 
>   
>  public void cell(String cellRef, String formattedValue) {
> try {
> !  AttributesImpl attributes = new AttributesImpl();
> !  attributes.addAttribute(null, "cellRef", "cellRef", null, 
> cellRef) ;
> !  xhtml.startElement("td",attributes);
>   
>// Main cell contents
>xhtml.characters(formattedValue);
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values

2018-03-07 Thread Radim Rehurek (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389884#comment-16389884
 ] 

Radim Rehurek edited comment on TIKA-1020 at 3/7/18 5:57 PM:
-

We just hit this bug too.

I say "bug" because Excel spreadsheets are really structured tables, just like 
[~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes 
little sense.

[~tpalsulich] IMO empty rows could be reported too, but in our use-case, the 
critical thing is not to have jumbled records caused by empty cells in a single 
row.


was (Author: piskvorky):
We just hit this bug too.

I say "bug" because Excel spreadsheets are really structured tables, just like 
[~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes 
little sense.

[~tpalsulich] IMO empty rows could be reported too, but in our use-case, the 
critical thing is not to have jumbled records (caused by missing cells in a 
single row).

> Excel 2010 parser missing cell values are not reported resulting in missing 
> columns values
> --
>
> Key: TIKA-1020
> URL: https://issues.apache.org/jira/browse/TIKA-1020
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
> Environment: java 1.6 & 1.7 
>Reporter: Neil Blue
>Priority: Major
>  Labels: newbie, patch
>
> When parting an excel 2010 table, if a worksheet has a missing value, then it 
> is not reported in the sax handler. As a result a missing value can result in 
> unordered data.
> For example given the table:
> {code:title=Bar.java|borderStyle=solid}
> A B B
> 1 2 3
> 4   6
> 7 8 9
> {code}
> the returned sax handler reports elements
> {code:title=Bar.java|borderStyle=solid}
> ABC
> 123
> 46
> 789
> {code}
> As a result the handler can detect that the third row as incomplete cell 
> values but it is ambiguous which columns have missing data.
> As a possible fix for this excel 2010 xml data contains the cell reference 
> value, which could be returned to the sax handler as an attribute. 
> {code:title=Bar.java|borderStyle=solid}
> *** XSSFExcelExtractorDecorator.java2012-11-08 10:51:55.881207100 +
> --- XSSFExcelExtractorDecorator.java.1  2012-11-08 10:59:02.972223700 +
> ***
> *** 200,206 
>   
>  public void cell(String cellRef, String formattedValue) {
> try {
> !  xhtml.startElement("td");
>   
>// Main cell contents
>xhtml.characters(formattedValue);
> --- 200,208 
>   
>  public void cell(String cellRef, String formattedValue) {
> try {
> !  AttributesImpl attributes = new AttributesImpl();
> !  attributes.addAttribute(null, "cellRef", "cellRef", null, 
> cellRef) ;
> !  xhtml.startElement("td",attributes);
>   
>// Main cell contents
>xhtml.characters(formattedValue);
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values

2018-03-07 Thread Radim Rehurek (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389884#comment-16389884
 ] 

Radim Rehurek edited comment on TIKA-1020 at 3/7/18 5:56 PM:
-

We just hit this bug too.

I say "bug" because Excel spreadsheets are really structured tables, just like 
[~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes 
little sense.

[~tpalsulich] empty rows could be reported too, but in our use-case, the 
critical thing is not to have jumbled records (caused by missing cells in a 
single row).


was (Author: piskvorky):
We just hit this bug too.

I say "bug" because Excel spreadsheets are really tables with rows, just like 
[~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes 
little sense.

[~tpalsulich] empty rows could be reported too, but in our use-case, the 
critical thing is not to have jumbled records (caused by missing cells in a 
single row).

> Excel 2010 parser missing cell values are not reported resulting in missing 
> columns values
> --
>
> Key: TIKA-1020
> URL: https://issues.apache.org/jira/browse/TIKA-1020
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
> Environment: java 1.6 & 1.7 
>Reporter: Neil Blue
>Priority: Major
>  Labels: newbie, patch
>
> When parting an excel 2010 table, if a worksheet has a missing value, then it 
> is not reported in the sax handler. As a result a missing value can result in 
> unordered data.
> For example given the table:
> {code:title=Bar.java|borderStyle=solid}
> A B B
> 1 2 3
> 4   6
> 7 8 9
> {code}
> the returned sax handler reports elements
> {code:title=Bar.java|borderStyle=solid}
> ABC
> 123
> 46
> 789
> {code}
> As a result the handler can detect that the third row as incomplete cell 
> values but it is ambiguous which columns have missing data.
> As a possible fix for this excel 2010 xml data contains the cell reference 
> value, which could be returned to the sax handler as an attribute. 
> {code:title=Bar.java|borderStyle=solid}
> *** XSSFExcelExtractorDecorator.java2012-11-08 10:51:55.881207100 +
> --- XSSFExcelExtractorDecorator.java.1  2012-11-08 10:59:02.972223700 +
> ***
> *** 200,206 
>   
>  public void cell(String cellRef, String formattedValue) {
> try {
> !  xhtml.startElement("td");
>   
>// Main cell contents
>xhtml.characters(formattedValue);
> --- 200,208 
>   
>  public void cell(String cellRef, String formattedValue) {
> try {
> !  AttributesImpl attributes = new AttributesImpl();
> !  attributes.addAttribute(null, "cellRef", "cellRef", null, 
> cellRef) ;
> !  xhtml.startElement("td",attributes);
>   
>// Main cell contents
>xhtml.characters(formattedValue);
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values

2018-03-07 Thread Radim Rehurek (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389884#comment-16389884
 ] 

Radim Rehurek commented on TIKA-1020:
-

We just hit this bug too.

I say "bug" because Excel spreadsheets are really tables with rows, just like 
[~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes 
little sense.

[~tpalsulich] empty rows could be reported too, but in our use-case, the 
critical thing is not to have jumbled records (caused by missing cells in a 
single row).

> Excel 2010 parser missing cell values are not reported resulting in missing 
> columns values
> --
>
> Key: TIKA-1020
> URL: https://issues.apache.org/jira/browse/TIKA-1020
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
> Environment: java 1.6 & 1.7 
>Reporter: Neil Blue
>Priority: Major
>  Labels: newbie, patch
>
> When parting an excel 2010 table, if a worksheet has a missing value, then it 
> is not reported in the sax handler. As a result a missing value can result in 
> unordered data.
> For example given the table:
> {code:title=Bar.java|borderStyle=solid}
> A B B
> 1 2 3
> 4   6
> 7 8 9
> {code}
> the returned sax handler reports elements
> {code:title=Bar.java|borderStyle=solid}
> ABC
> 123
> 46
> 789
> {code}
> As a result the handler can detect that the third row as incomplete cell 
> values but it is ambiguous which columns have missing data.
> As a possible fix for this excel 2010 xml data contains the cell reference 
> value, which could be returned to the sax handler as an attribute. 
> {code:title=Bar.java|borderStyle=solid}
> *** XSSFExcelExtractorDecorator.java2012-11-08 10:51:55.881207100 +
> --- XSSFExcelExtractorDecorator.java.1  2012-11-08 10:59:02.972223700 +
> ***
> *** 200,206 
>   
>  public void cell(String cellRef, String formattedValue) {
> try {
> !  xhtml.startElement("td");
>   
>// Main cell contents
>xhtml.characters(formattedValue);
> --- 200,208 
>   
>  public void cell(String cellRef, String formattedValue) {
> try {
> !  AttributesImpl attributes = new AttributesImpl();
> !  attributes.addAttribute(null, "cellRef", "cellRef", null, 
> cellRef) ;
> !  xhtml.startElement("td",attributes);
>   
>// Main cell contents
>xhtml.characters(formattedValue);
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Tika 1.18?

2018-03-07 Thread Luís Filipe Nassif
I thought about logging any custom-mimetype override applied, so the user
will be warned about that. Maybe additionally creating a specific attribute
in mimetype definition xml to configure it must override the default one
instead of aborting. About multiple conflicting custom mimes from different
(external) projetcs, Tika currently aborts and it is already a problem now.

So I think it needs additional discussion and should not be addressed in
the next release. Will copy/paste this discussion in the jira issue.

But I would like to see fixed the detection of MTS videos, but it conflicts
with another existing mime glob. Any workaround for this specific case? If
yes, I can open a different ticket.



Em 2 de mar de 2018 18:23, "Nick Burch"  escreveu:

On Fri, 2 Mar 2018, Luís Filipe Nassif wrote:

> If I make no progress on TIKA-1466 until 3/9, you can start the release
> process without it. But do you devs agree with the proposed change: allow
> overriding of glob patterns in custom-mimetypes.xml?
>

What happens if you have two different custom files which both claim the
same glob?

We have historically been a bit stricter about built-in types overriding,
in part to avoid people doing silly things by mistake, and in part to push
people a bit more towards contributing fixes/enhancements for built-in
types. I think the latter is less of a thing today, as we've a lot more
covered as standard, so it's just the former we need to worry about.

How do we help people know when they have conflicting overrides (possibly
from different projects), help them sensibly merge or turn off Tika
provided magic+definitions, and to alert them to when their copied +
customised version probably wants updating following a tika upgrade giving
a newer definition? Do a better job of those than we currently do now, then
I'm very happy to +1 it :)

Nick


[jira] [Commented] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389743#comment-16389743
 ] 

Hudson commented on TIKA-2600:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1450 (See 
[https://builds.apache.org/job/Tika-trunk/1450/])
TIKA-2600 -- remove md5 checksum, and switch sha-1 to sha-512 for (tallison: 
[https://github.com/apache/tika/commit/19017c91b245ebd72fefe005cd67d3da68037cc5])
* (edit) pom.xml


> Don't use md5 checksum due to changes to the release distribuition policy
> -
>
> Key: TIKA-2600
> URL: https://issues.apache.org/jira/browse/TIKA-2600
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 1.18, 2.0.0
>
>
> To plagiarize from PDFBOX-4142:
> The release distribution policy was changes with regard to the checksums to 
> be used:
> Old policy :
> MUST provide a MD5-file
> SHOULD provide a SHA-file [SHA-512 recommended]
> New policy :
> MUST provide a SHA- or MD5-file
> SHOULD provide a SHA-file
> SHOULD NOT provide a MD5-file
> see http://www.apache.org/dev/release-distribution for further details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389694#comment-16389694
 ] 

Hudson commented on TIKA-2600:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #211 (See 
[https://builds.apache.org/job/tika-2.x-windows/211/])
TIKA-2600 -- remove md5 checksum, and switch sha-1 to sha-512 for (tallison: 
rev 19017c91b245ebd72fefe005cd67d3da68037cc5)
* (edit) pom.xml


> Don't use md5 checksum due to changes to the release distribuition policy
> -
>
> Key: TIKA-2600
> URL: https://issues.apache.org/jira/browse/TIKA-2600
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 1.18, 2.0.0
>
>
> To plagiarize from PDFBOX-4142:
> The release distribution policy was changes with regard to the checksums to 
> be used:
> Old policy :
> MUST provide a MD5-file
> SHOULD provide a SHA-file [SHA-512 recommended]
> New policy :
> MUST provide a SHA- or MD5-file
> SHOULD provide a SHA-file
> SHOULD NOT provide a MD5-file
> see http://www.apache.org/dev/release-distribution for further details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


tika-2.x-windows - Build # 211 - Still Failing

2018-03-07 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #211)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/211/ to 
view the results.

[jira] [Commented] (TIKA-2579) Update to PDFBox 2.0.9 when available

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389677#comment-16389677
 ] 

Tim Allison commented on TIKA-2579:
---

Release cycle for PDFBox 2.0.9 is just getting under way.

https://lists.apache.org/thread.html/63f4f538de8ba684a18c9514a64ebfb8fa30053dfb885e459ccd6741@%3Cdev.pdfbox.apache.org%3E

> Update to PDFBox 2.0.9 when available
> -
>
> Key: TIKA-2579
> URL: https://issues.apache.org/jira/browse/TIKA-2579
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.17
>Reporter: David Pilato
>Assignee: Tim Allison
>Priority: Major
>
> Hey team
>  
> We got this report in elasticsearch ingest attachment project: 
> [https://github.com/elastic/elasticsearch/issues/27198]
> Basically when a font is not available PDFBox is throwing an exception like
> {{2017/10/31 00:01:13.348 [WARN ] [elasticsearch[test][bulk][T#3]] 
> [FontManager] Font not found: TimesNewRomanPS-BoldMT 2017/10/31 00:01:13.413 
> [ERROR] [elasticsearch[test][bulk][T#3]] [TrueTypeFont] An error occured when 
> reading table cmap java.io.IOException: CMap subtype 14 not yet implemented 
> at 
> org.apache.fontbox.ttf.CMAPEncodingEntry.processSubtype14(CMAPEncodingEntry.java:304)
>  at 
> org.apache.fontbox.ttf.CMAPEncodingEntry.initSubtable(CMAPEncodingEntry.java:114)
>  at org.apache.fontbox.ttf.CMAPTable.initData(CMAPTable.java:100) at 
> org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128)
>  at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109) 
> at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at 
> org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84) 
> at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632)
>  at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673)
>  at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
>  at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:533)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
>  at 
> org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62) 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>  at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:458) 
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) 
> at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) 
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148) at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at 
> org.apache.tika.Tika.parseToString(Tika.java:537)}}
> This might have been solved by PDFParser with 
> https://issues.apache.org/jira/browse/PDFBOX-3997 which is available in 
> PDFBox 2.0.9 but Tika 1.17 is still using 2.0.8. See related issue 
> https://issues.apache.org/jira/browse/PDFBOX-3985. Unclear if that will 
> actually fix the problem reported but FWIW upgrading to 2.0.9 of PDFBox could 
> be useful.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389661#comment-16389661
 ] 

Tim Allison commented on TIKA-2594:
---

add the following or is this too lenient?

{noformat}
  
{noformat}

> Mail detected as application/xhtml+xml
> --
>
> Key: TIKA-2594
> URL: https://issues.apache.org/jira/browse/TIKA-2594
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0, 1.16, 1.17
>Reporter: Andreas Meier
>Priority: Major
> Attachments: TestMail_inline_xhtml_plus_image.eml
>
>
> The attached mail (message/rfc822) with inline xhtml is recognized as 
> application/xhtml+xml
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy

2018-03-07 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2600.
---
   Resolution: Fixed
 Assignee: Tim Allison
Fix Version/s: 2.0.0
   1.18

> Don't use md5 checksum due to changes to the release distribuition policy
> -
>
> Key: TIKA-2600
> URL: https://issues.apache.org/jira/browse/TIKA-2600
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 1.18, 2.0.0
>
>
> To plagiarize from PDFBOX-4142:
> The release distribution policy was changes with regard to the checksums to 
> be used:
> Old policy :
> MUST provide a MD5-file
> SHOULD provide a SHA-file [SHA-512 recommended]
> New policy :
> MUST provide a SHA- or MD5-file
> SHOULD provide a SHA-file
> SHOULD NOT provide a MD5-file
> see http://www.apache.org/dev/release-distribution for further details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2598) Fix dependency convergence

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389631#comment-16389631
 ] 

Hudson commented on TIKA-2598:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #210 (See 
[https://builds.apache.org/job/tika-2.x-windows/210/])
TIKA-2598 -- unbreak the build (sorry, again!), fix missing javacpp (tallison: 
rev 474122bef3d906f81b91729a970a6ad7b5639a5c)
* (edit) tika-dl/pom.xml


> Fix dependency convergence
> --
>
> Key: TIKA-2598
> URL: https://issues.apache.org/jira/browse/TIKA-2598
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Affects Versions: 1.17
>Reporter: Guillaume Smet
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0, 1.18
>
>
> Hi,
> We tried to upgrade Tika to 1.17 in Hibernate Search and we had some 
> dependency convergence issues:
> {code}
> Dependency convergence error for 
> com.healthmarketscience.jackcess:jackcess:2.1.8 paths to dependency are:
> +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT
>     +-org.apache.tika:tika-parsers:1.17
>          +-com.healthmarketscience.jackcess:jackcess:2.1.8
> and
> +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT
>      +-org.apache.tika:tika-parsers:1.17
>          +-com.healthmarketscience.jackcess:jackcess-encrypt:2.1.2
>              +-com.healthmarketscience.jackcess:jackcess:2.1.0
> {code}
> We could fix them downstream in Hibernate Search but I thought it would be 
> better if Tika could ensure the convergence of its dependencies using the 
> Maven enforcer plugin so that all the downstream projects can benefit from it.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


tika-2.x-windows - Build # 210 - Still Failing

2018-03-07 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #210)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/210/ to 
view the results.

[jira] [Commented] (TIKA-2598) Fix dependency convergence

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389599#comment-16389599
 ] 

Hudson commented on TIKA-2598:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #4 (See 
[https://builds.apache.org/job/tika-branch-1x/4/])
TIKA-2598 -- unbreak the build (sorry, again!), fix missing javacpp (tallison: 
[https://github.com/apache/tika/commit/8163b598a73733554a8a87bde10a562291e4ec79])
* (edit) tika-dl/pom.xml


> Fix dependency convergence
> --
>
> Key: TIKA-2598
> URL: https://issues.apache.org/jira/browse/TIKA-2598
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Affects Versions: 1.17
>Reporter: Guillaume Smet
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0, 1.18
>
>
> Hi,
> We tried to upgrade Tika to 1.17 in Hibernate Search and we had some 
> dependency convergence issues:
> {code}
> Dependency convergence error for 
> com.healthmarketscience.jackcess:jackcess:2.1.8 paths to dependency are:
> +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT
>     +-org.apache.tika:tika-parsers:1.17
>          +-com.healthmarketscience.jackcess:jackcess:2.1.8
> and
> +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT
>      +-org.apache.tika:tika-parsers:1.17
>          +-com.healthmarketscience.jackcess:jackcess-encrypt:2.1.2
>              +-com.healthmarketscience.jackcess:jackcess:2.1.0
> {code}
> We could fix them downstream in Hibernate Search but I thought it would be 
> better if Tika could ensure the convergence of its dependencies using the 
> Maven enforcer plugin so that all the downstream projects can benefit from it.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2598) Fix dependency convergence

2018-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389595#comment-16389595
 ] 

Hudson commented on TIKA-2598:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1448 (See 
[https://builds.apache.org/job/Tika-trunk/1448/])
TIKA-2598 -- unbreak the build (sorry, again!), fix missing javacpp (tallison: 
[https://github.com/apache/tika/commit/474122bef3d906f81b91729a970a6ad7b5639a5c])
* (edit) tika-dl/pom.xml


> Fix dependency convergence
> --
>
> Key: TIKA-2598
> URL: https://issues.apache.org/jira/browse/TIKA-2598
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Affects Versions: 1.17
>Reporter: Guillaume Smet
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0, 1.18
>
>
> Hi,
> We tried to upgrade Tika to 1.17 in Hibernate Search and we had some 
> dependency convergence issues:
> {code}
> Dependency convergence error for 
> com.healthmarketscience.jackcess:jackcess:2.1.8 paths to dependency are:
> +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT
>     +-org.apache.tika:tika-parsers:1.17
>          +-com.healthmarketscience.jackcess:jackcess:2.1.8
> and
> +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT
>      +-org.apache.tika:tika-parsers:1.17
>          +-com.healthmarketscience.jackcess:jackcess-encrypt:2.1.2
>              +-com.healthmarketscience.jackcess:jackcess:2.1.0
> {code}
> We could fix them downstream in Hibernate Search but I thought it would be 
> better if Tika could ensure the convergence of its dependencies using the 
> Maven enforcer plugin so that all the downstream projects can benefit from it.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389556#comment-16389556
 ] 

Tim Allison edited comment on TIKA-2600 at 3/7/18 1:47 PM:
---

I'm pretty sure we discussed this at some point, but I can't quickly find what 
we decided.  Apologies if this is a duplicate issue...

Should we stop including md5 and swap SHA1 (with file ext: .sha) for SHA-512 
(with file ext: .sha512)?


was (Author: talli...@mitre.org):
I'm pretty sure we discussed this at some point, but I can't quickly find what 
we decided.  Apologies if this is a duplicate issue...

Should we stop including md5 and swap SHA1 (with file ext: .sha) for SHA512 
(with file ext: .sha512)?

> Don't use md5 checksum due to changes to the release distribuition policy
> -
>
> Key: TIKA-2600
> URL: https://issues.apache.org/jira/browse/TIKA-2600
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Blocker
>
> To plagiarize from PDFBOX-4142:
> The release distribution policy was changes with regard to the checksums to 
> be used:
> Old policy :
> MUST provide a MD5-file
> SHOULD provide a SHA-file [SHA-512 recommended]
> New policy :
> MUST provide a SHA- or MD5-file
> SHOULD provide a SHA-file
> SHOULD NOT provide a MD5-file
> see http://www.apache.org/dev/release-distribution for further details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389556#comment-16389556
 ] 

Tim Allison commented on TIKA-2600:
---

I'm pretty sure we discussed this at some point, but I can't quickly find what 
we decided.  Apologies if this is a duplicate issue...

Should we stop including md5 and swap SHA1 (with file ext: .sha) for SHA512 
(with file ext: .sha512)?

> Don't use md5 checksum due to changes to the release distribuition policy
> -
>
> Key: TIKA-2600
> URL: https://issues.apache.org/jira/browse/TIKA-2600
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Blocker
>
> To plagiarize from PDFBOX-4142:
> The release distribution policy was changes with regard to the checksums to 
> be used:
> Old policy :
> MUST provide a MD5-file
> SHOULD provide a SHA-file [SHA-512 recommended]
> New policy :
> MUST provide a SHA- or MD5-file
> SHOULD provide a SHA-file
> SHOULD NOT provide a MD5-file
> see http://www.apache.org/dev/release-distribution for further details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy

2018-03-07 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2600:
-

 Summary: Don't use md5 checksum due to changes to the release 
distribuition policy
 Key: TIKA-2600
 URL: https://issues.apache.org/jira/browse/TIKA-2600
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


To plagiarize from PDFBOX-4142:
The release distribution policy was changes with regard to the checksums to be 
used:

Old policy :

MUST provide a MD5-file
SHOULD provide a SHA-file [SHA-512 recommended]
New policy :

MUST provide a SHA- or MD5-file
SHOULD provide a SHA-file
SHOULD NOT provide a MD5-file
see http://www.apache.org/dev/release-distribution for further details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389532#comment-16389532
 ] 

Tim Allison commented on TIKA-1518:
---

Hi [~davemeikle], with the new dockerfile-maven-plugin, I'm getting the 
following.  I'm behind a proxy, and I'm on windows, but you'd think localhost 
would work?!  Any recommendations?  Thank you!

{noformat}
[INFO] --- dockerfile-maven-plugin:1.3.7:build (default) @ tika-server ---
[INFO] Building Docker context C:\Users\tallison\Idea 
Projects\tika-asf2-git-2.x\tika-server
[INFO]
[INFO] Image will be built as apache/tika-server:2.0.0-SNAPSHOT
[INFO]
[WARNING] An attempt failed, will retry 1 more times
org.apache.maven.plugin.MojoExecutionException: Could not build image
at 
com.spotify.plugin.dockerfile.BuildMojo.buildImage(BuildMojo.java:185)
at com.spotify.plugin.dockerfile.BuildMojo.execute(BuildMojo.java:105)
at 
com.spotify.plugin.dockerfile.AbstractDockerMojo.tryExecute(AbstractDockerMojo.java:246)
at 
com.spotify.plugin.dockerfile.AbstractDockerMojo.execute(AbstractDockerMojo.java:235)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:13
4)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.
java:117)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.
java:81)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleTh
readedBuilder.java:51)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:309)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:194)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:107)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:993)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:345)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:191)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: com.spotify.docker.client.exceptions.DockerException: 
java.util.concurrent.ExecutionException:
com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
org.apache.http.conn.HttpHostConnectExce
ption: Connect to localhost:2375 [localhost/127.0.0.1, 
localhost/0:0:0:0:0:0:0:1] failed: Connection refus
ed: connect
at 
com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:2512)
at 
com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:2443)
at 
com.spotify.docker.client.DefaultDockerClient.version(DefaultDockerClient.java:501)
at 
com.spotify.docker.client.DefaultDockerClient.authRegistryHeader(DefaultDockerClient.java:2555)

at 
com.spotify.docker.client.DefaultDockerClient.build(DefaultDockerClient.java:1396)
at 
com.spotify.docker.client.DefaultDockerClient.build(DefaultDockerClient.java:1365)
at 
com.spotify.plugin.dockerfile.BuildMojo.buildImage(BuildMojo.java:178)
... 25 more
Caused by: java.util.concurrent.ExecutionException: 
com.spotify.docker.client.shaded.javax.ws.rs.Processin
gException: org.apache.http.conn.HttpHostConnectException: Connect to 
localhost:2375 [localhost/127.0.0.1,
 localhost/0:0:0:0:0:0:0:1] failed: Connection refused: connect
at 
jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture
.java:299)
at 
jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java
:286)
at 
jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)

at 
com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:2441)
... 30 more
Caused by: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 

[jira] [Commented] (TIKA-2598) Fix dependency convergence

2018-03-07 Thread Guillaume Smet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389357#comment-16389357
 ] 

Guillaume Smet commented on TIKA-2598:
--

Hi [~talli...@mitre.org],

Sorry for the delay. So to fix the issue, you can use exclusions as you did. 
The drawback of this approach is that, if a new dependency adds a component 
with yet another version, you need to add new exclusions.

The other option is to use a {{}} section in your parent 
pom. All the dependencies defined in this section will have the fixed version 
you define, and it will enforce that to the transitive dependencies.

It's usually the recommended approach, but seeing your patch, it looks like 
using exclusions is not that bad in your case.

Thanks for the quick action on this!

> Fix dependency convergence
> --
>
> Key: TIKA-2598
> URL: https://issues.apache.org/jira/browse/TIKA-2598
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Affects Versions: 1.17
>Reporter: Guillaume Smet
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0, 1.18
>
>
> Hi,
> We tried to upgrade Tika to 1.17 in Hibernate Search and we had some 
> dependency convergence issues:
> {code}
> Dependency convergence error for 
> com.healthmarketscience.jackcess:jackcess:2.1.8 paths to dependency are:
> +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT
>     +-org.apache.tika:tika-parsers:1.17
>          +-com.healthmarketscience.jackcess:jackcess:2.1.8
> and
> +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT
>      +-org.apache.tika:tika-parsers:1.17
>          +-com.healthmarketscience.jackcess:jackcess-encrypt:2.1.2
>              +-com.healthmarketscience.jackcess:jackcess:2.1.0
> {code}
> We could fix them downstream in Hibernate Search but I thought it would be 
> better if Tika could ensure the convergence of its dependencies using the 
> Maven enforcer plugin so that all the downstream projects can benefit from it.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)