[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390381#comment-16390381 ] Hudson commented on TIKA-1518: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #216 (See [https://builds.apache.org/job/tika-2.x-windows/216/]) TIKA-1518 -- turn dockerfile-maven-plugin back on. Accidentally (tallison: rev ca19696657cca2ec83160f9a16cbb36bfc35cde6) * (edit) tika-server/pom.xml > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > Attachments: tika-server-docker-err-msg.txt > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml
[ https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390350#comment-16390350 ] Hudson commented on TIKA-2594: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #6 (See [https://builds.apache.org/job/tika-branch-1x/6/]) TIKA-2594 improve eml detection via Luis Filipe Nassif (tallison: [https://github.com/apache/tika/commit/e12117c0e4792404eca825df0d2ae9925f0d5d18]) * (edit) tika-server/pom.xml * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml > Mail detected as application/xhtml+xml > -- > > Key: TIKA-2594 > URL: https://issues.apache.org/jira/browse/TIKA-2594 > Project: Tika > Issue Type: Bug >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Major > Fix For: 1.18, 2.0.0 > > Attachments: TestMail_inline_xhtml_plus_image.eml > > > The attached mail (message/rfc822) with inline xhtml is recognized as > application/xhtml+xml > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390351#comment-16390351 ] Hudson commented on TIKA-1518: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #6 (See [https://builds.apache.org/job/tika-branch-1x/6/]) TIKA-1518: Detach docker file build from build phase in Maven execution (david: [https://github.com/apache/tika/commit/42aa774f1e1d232ee9f98b58ace9f0417231716b]) * (edit) tika-server/pom.xml > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > Attachments: tika-server-docker-err-msg.txt > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only
[ https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390348#comment-16390348 ] Hudson commented on TIKA-2590: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #6 (See [https://builds.apache.org/job/tika-branch-1x/6/]) TIKA-2590 update Changes.txt (tallison: [https://github.com/apache/tika/commit/c566cc472a4c9daf1e99fb80de9df2390b342350]) * (edit) CHANGES.txt > ExcelExtractor: cannot choose listening to the selected records only > > > Key: TIKA-2590 > URL: https://issues.apache.org/jira/browse/TIKA-2590 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Grigoriy Alekseev >Priority: Critical > Fix For: 1.18, 2.0.0 > > > The listenForAllRecords argument is being always reset to 'true', so the > 'else' branch is never reached. It may cause incorrect text extraction when > records with certain unsupported types (e.g. SharedFormula) are present in a > file. > {code:java} > public void processFile(DirectoryNode root, boolean > listenForAllRecords) > throws IOException, SAXException, TikaException { > // Set up listener and register the records we want to process > HSSFRequest hssfRequest = new HSSFRequest(); > listenForAllRecords = true; > if (listenForAllRecords) { > hssfRequest.addListenerForAllRecords(formatListener); > } else { > hssfRequest.addListener(formatListener, BOFRecord.sid); > hssfRequest.addListener(formatListener, EOFRecord.sid); > hssfRequest.addListener(formatListener, > DateWindow1904Record.sid); > hssfRequest.addListener(formatListener, CountryRecord.sid); > hssfRequest.addListener(formatListener, BoundSheetRecord.sid); > hssfRequest.addListener(formatListener, SSTRecord.sid); > hssfRequest.addListener(formatListener, FormulaRecord.sid); > hssfRequest.addListener(formatListener, LabelRecord.sid); > hssfRequest.addListener(formatListener, LabelSSTRecord.sid); > hssfRequest.addListener(formatListener, NumberRecord.sid); > hssfRequest.addListener(formatListener, RKRecord.sid); > hssfRequest.addListener(formatListener, StringRecord.sid); > hssfRequest.addListener(formatListener, HyperlinkRecord.sid); > hssfRequest.addListener(formatListener, TextObjectRecord.sid); > hssfRequest.addListener(formatListener, SeriesTextRecord.sid); > hssfRequest.addListener(formatListener, FormatRecord.sid); > hssfRequest.addListener(formatListener, > ExtendedFormatRecord.sid); > hssfRequest.addListener(formatListener, > DrawingGroupRecord.sid); > if > (extractor.officeParserConfig.getIncludeHeadersAndFooters()) { > hssfRequest.addListener(formatListener, HeaderRecord.sid); > hssfRequest.addListener(formatListener, FooterRecord.sid); > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2527) Typos in tika-mimetypes.xml
[ https://issues.apache.org/jira/browse/TIKA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390349#comment-16390349 ] Hudson commented on TIKA-2527: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #6 (See [https://builds.apache.org/job/tika-branch-1x/6/]) TIKA-2527 -- Various new mimes and typo fixes in tika-mimetypes.xml via (tallison: [https://github.com/apache/tika/commit/33f756fa4581ae3d1643ea7299121139a5c1bc6d]) * (edit) CHANGES.txt * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml > Typos in tika-mimetypes.xml > --- > > Key: TIKA-2527 > URL: https://issues.apache.org/jira/browse/TIKA-2527 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 2.0, 1.16, 1.17, 1.18 > Environment: ALL >Reporter: Andreas Meier >Priority: Minor > Fix For: 1.18, 2.0.0 > > Attachments: enhancement-for-TIKA2527-contributed-by-AMeier.patch, > fix-for-TIKA2527-contributed-by-AMeier-Fixed-adpcmmi.patch, > fix-for-binhexmatch-TIKA2527-contributed-by-AMeier.patch > > > Are these mimetypes in tika-mimetypes.xml > audio/x-adbcm instead audio/x-adpcm > {code:xml} {code} > and > audio/x-dec-adbcm instead audio/x-dec-adpcm > {code:xml} {code} > intended? > Couldn't find these mimetypes. > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390335#comment-16390335 ] Hudson commented on TIKA-1518: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1454 (See [https://builds.apache.org/job/Tika-trunk/1454/]) TIKA-1518: Detach docker file build from build phase in Maven execution (david: [https://github.com/apache/tika/commit/deb9e96f29d3a322804016d4533bb76de7c40e2c]) * (edit) tika-server/pom.xml TIKA-1518 -- turn dockerfile-maven-plugin back on. Accidentally (tallison: [https://github.com/apache/tika/commit/ca19696657cca2ec83160f9a16cbb36bfc35cde6]) * (edit) tika-server/pom.xml > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > Attachments: tika-server-docker-err-msg.txt > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2527) Typos in tika-mimetypes.xml
[ https://issues.apache.org/jira/browse/TIKA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390333#comment-16390333 ] Hudson commented on TIKA-2527: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1454 (See [https://builds.apache.org/job/Tika-trunk/1454/]) TIKA-2527 -- Various new mimes and typo fixes in tika-mimetypes.xml via (tallison: [https://github.com/apache/tika/commit/9b7154cf37871f5ef0874e972ec9208538e15e44]) * (edit) CHANGES.txt * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml > Typos in tika-mimetypes.xml > --- > > Key: TIKA-2527 > URL: https://issues.apache.org/jira/browse/TIKA-2527 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 2.0, 1.16, 1.17, 1.18 > Environment: ALL >Reporter: Andreas Meier >Priority: Minor > Fix For: 1.18, 2.0.0 > > Attachments: enhancement-for-TIKA2527-contributed-by-AMeier.patch, > fix-for-TIKA2527-contributed-by-AMeier-Fixed-adpcmmi.patch, > fix-for-binhexmatch-TIKA2527-contributed-by-AMeier.patch > > > Are these mimetypes in tika-mimetypes.xml > audio/x-adbcm instead audio/x-adpcm > {code:xml} {code} > and > audio/x-dec-adbcm instead audio/x-dec-adpcm > {code:xml} {code} > intended? > Couldn't find these mimetypes. > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml
[ https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390334#comment-16390334 ] Hudson commented on TIKA-2594: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1454 (See [https://builds.apache.org/job/Tika-trunk/1454/]) TIKA-2594 improve eml detection via Luis Filipe Nassif (tallison: [https://github.com/apache/tika/commit/9c0a822419797f20a09388ccd235c7e70db9]) * (edit) tika-server/pom.xml * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml > Mail detected as application/xhtml+xml > -- > > Key: TIKA-2594 > URL: https://issues.apache.org/jira/browse/TIKA-2594 > Project: Tika > Issue Type: Bug >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Major > Fix For: 1.18, 2.0.0 > > Attachments: TestMail_inline_xhtml_plus_image.eml > > > The attached mail (message/rfc822) with inline xhtml is recognized as > application/xhtml+xml > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390317#comment-16390317 ] Dave Meikle commented on TIKA-1518: --- [~talli...@mitre.org] - ah it looks like the proxy settings aren't being passed into the Docker container. Normally I've passed proxy settings via buildArgs to docker but I am not sure how this is handled by the Maven plugin. I've not done docker behind a proxy for a while. Can you try -X on the maven command to see what is being set? > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > Attachments: tika-server-docker-err-msg.txt > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml
[ https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390308#comment-16390308 ] Hudson commented on TIKA-2594: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #215 (See [https://builds.apache.org/job/tika-2.x-windows/215/]) TIKA-2594 improve eml detection via Luis Filipe Nassif (tallison: rev 9c0a822419797f20a09388ccd235c7e70db9) * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * (edit) tika-server/pom.xml > Mail detected as application/xhtml+xml > -- > > Key: TIKA-2594 > URL: https://issues.apache.org/jira/browse/TIKA-2594 > Project: Tika > Issue Type: Bug >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Major > Fix For: 1.18, 2.0.0 > > Attachments: TestMail_inline_xhtml_plus_image.eml > > > The attached mail (message/rfc822) with inline xhtml is recognized as > application/xhtml+xml > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Tika 1.18?
Sounds good to me thanks Tim. Happy to line it up with PDF Box 2.0.9 On 3/7/18, 1:16 PM, "Allison, Timothy B."wrote: All, I think I've made the updates that I wanted to make sure got in to 1.18. It looks like PDFBox is going to start their release cycle shortly. Should we wait for PDFBox 2.0.9? That may add a week or two to our release, although, frankly, it might not. We can start running the regression tests March 9(ish) and see if anything dire appears... Cheers, Tim
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390309#comment-16390309 ] Hudson commented on TIKA-1518: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #215 (See [https://builds.apache.org/job/tika-2.x-windows/215/]) TIKA-1518: Detach docker file build from build phase in Maven execution (david: rev deb9e96f29d3a322804016d4533bb76de7c40e2c) * (edit) tika-server/pom.xml > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > Attachments: tika-server-docker-err-msg.txt > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2527) Typos in tika-mimetypes.xml
[ https://issues.apache.org/jira/browse/TIKA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390307#comment-16390307 ] Hudson commented on TIKA-2527: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #215 (See [https://builds.apache.org/job/tika-2.x-windows/215/]) TIKA-2527 -- Various new mimes and typo fixes in tika-mimetypes.xml via (tallison: rev 9b7154cf37871f5ef0874e972ec9208538e15e44) * (edit) CHANGES.txt * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml > Typos in tika-mimetypes.xml > --- > > Key: TIKA-2527 > URL: https://issues.apache.org/jira/browse/TIKA-2527 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 2.0, 1.16, 1.17, 1.18 > Environment: ALL >Reporter: Andreas Meier >Priority: Minor > Fix For: 1.18, 2.0.0 > > Attachments: enhancement-for-TIKA2527-contributed-by-AMeier.patch, > fix-for-TIKA2527-contributed-by-AMeier-Fixed-adpcmmi.patch, > fix-for-binhexmatch-TIKA2527-contributed-by-AMeier.patch > > > Are these mimetypes in tika-mimetypes.xml > audio/x-adbcm instead audio/x-adpcm > {code:xml} {code} > and > audio/x-dec-adbcm instead audio/x-dec-adpcm > {code:xml} {code} > intended? > Couldn't find these mimetypes. > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390284#comment-16390284 ] Dave Meikle edited comment on TIKA-1518 at 3/7/18 9:41 PM: --- It is a choice we have to make. There are three mains routes to Docker packaging that I have used: # Automated builds that pull in pre-packaged and then get bundled into an image on any change in the an repository - like what we are doing n docker-tikaserver approach where is goes and downloads the signed JARs # Automated builds that compile the code in the image (e.g. using the maven Docker image) and then package them # Building a release image and then distributing that - which is what this does but requires us to decide when an official release is available and push it somewhere The first and second are really good for leveraging things like Docker Hub to automatically build from your repository, where as the third means you have to have Docker on your machine when you want to build an image. I never really like number two as it means the builds are always recompiles of the code each time a change is triggered, so you can easily be packing up different code as the same version without realising it. The challenge with the approach in docker-tikaserver is maintaining when assets that are being pulled in move - i.e. when an release JAR is move from dist.apache.org - but that could easily be solved by going to Nexus for the JARs based on the release packages. I personally quite like the third approach as it means you explicit create an image that has its own life and was thinking that we could potentially add this to the release process, pushing the image from the release build to Docker Hub/Nexus/Another Repos so it is an official build. So just like when we do a mvn release we can go to tika-server and do a mvn dockerfile:build and if happy mvn dockerfile:push (once we bottom out where). Not sure what others think? was (Author: davemeikle): It is a choice we have to make. There are three mains routes to Docker packaging that I have used: # Automated builds that pull in pre-packaged and then get bundled into an image on any change in the an repository - like what we are doing n docker-tikaserver approach where is goes and downloads the signed JARs # Automated builds that compile the code in the image (e.g. using the maven Docker image) and then package them # Building a release image and then distributing that - which is what this does but requires us to decide when an official release is available and push it somewhere The first and second are really good for leveraging things like Docker Hub to automatically build from your repository, where as the third means you have to have Docker on your machine when you want to build an image. I never really like number two as it means the builds are always recompiles of the code each time a change is triggered, so you can easily be packing up different code as the same version without realising it. The challenge with the approach in docker-tikaserver is maintaining when assets that are being pulled in move - i.e. when an release JAR is move from dist.apache.org - but that could easily be solved by going to Nexus for the JARs based on the release packages. I personally quite like the third approach as it means you explicit create an image that has its own life and was thinking that we could potentially add this to the release process, pushing the image from the release build to Docker Hub/Nexus/Another Repos so it is an official build. Not sure what others think? > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > Attachments: tika-server-docker-err-msg.txt > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390284#comment-16390284 ] Dave Meikle commented on TIKA-1518: --- It is a choice we have to make. There are three mains routes to Docker packaging that I have used: # Automated builds that pull in pre-packaged and then get bundled into an image on any change in the an repository - like what we are doing n docker-tikaserver approach where is goes and downloads the signed JARs # Automated builds that compile the code in the image (e.g. using the maven Docker image) and then package them # Building a release image and then distributing that - which is what this does but requires us to decide when an official release is available and push it somewhere The first and second are really good for leveraging things like Docker Hub to automatically build from your repository, where as the third means you have to have Docker on your machine when you want to build an image. I never really like number two as it means the builds are always recompiles of the code each time a change is triggered, so you can easily be packing up different code as the same version without realising it. The challenge with the approach in docker-tikaserver is maintaining when assets that are being pulled in move - i.e. when an release JAR is move from dist.apache.org - but that could easily be solved by going to Nexus for the JARs based on the release packages. I personally quite like the third approach as it means you explicit create an image that has its own life and was thinking that we could potentially add this to the release process, pushing the image from the release build to Docker Hub/Nexus/Another Repos so it is an official build. Not sure what others think? > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > Attachments: tika-server-docker-err-msg.txt > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390282#comment-16390282 ] Tim Allison commented on TIKA-2592: --- {quote}I already have a small testset I run tika against (~300k+ files), that is also the reason for the numerous tickets I created lately. {quote} Great, and thank you! {quote}Too many people and nightly builds stressing one vm may be too much. {quote} As long as you aren't active during release cycles, we won't stress is much. :D Finally, if you want to get involved with the tika-eval module and/or if you have any code you've found helpful in evaluating different runs or single runs, let us know! > HTML with charset unicode handled as utf-16 instead utf-8 > - > > Key: TIKA-2592 > URL: https://issues.apache.org/jira/browse/TIKA-2592 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Minor > Fix For: 1.18, 2.0.0 > > Attachments: IANA Charset names.txt, > StandardCharsets_unsupported_by_IANA.txt, TestCharsetUnicodeHTML.html, > TestHTMLCharsetArabicCP1256.html, TestHTMLCharsetCP1256.html, > fix-for-TIKA2592-contributed-by-Andreas-Meier.patch > > > HTML files are detected as utf-16 when meta content is set to "unicode". > {code:XML} > > {code} > > Shouldn't the default be utf-8? > The attached sample file is shown correctly in: > Chromium Version 55.0.2883.75 > Firefox 50.1.0 > IE 11 > I am aware that there is no charset "unicode" (available character encodings: > [http://www.iana.org/assignments/character-sets/character-sets.xhtml|http://www.iana.org/assignments/character-sets/character-sets.xhtml]) > Unfortunately there are many wrong encodings used out there. > All unknown encodings should be validated or at least be set to default utf-8. > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390275#comment-16390275 ] Tim Allison edited comment on TIKA-1518 at 3/7/18 9:33 PM: --- And sorry for letting the <\!-- --> slip through!!! was (Author: talli...@mitre.org): And sorry for letting the slip through!!! > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > Attachments: tika-server-docker-err-msg.txt > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390274#comment-16390274 ] Tim Allison commented on TIKA-1518: --- Your [commit|https://github.com/apache/tika/commit/deb9e96f29d3a322804016d4533bb76de7c40e2c#diff-332a9cfb880c4a30e2abc7af93035120] sure fixed it by turning it off. :D > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > Attachments: tika-server-docker-err-msg.txt > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390275#comment-16390275 ] Tim Allison commented on TIKA-1518: --- And sorry for letting the slip through!!! > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > Attachments: tika-server-docker-err-msg.txt > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390268#comment-16390268 ] Tim Allison commented on TIKA-1518: --- Not quite, different error this time (see attached file)...could be user error, I have no doubt! OTOH, do we want to require Docker on devs' computers? > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > Attachments: tika-server-docker-err-msg.txt > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1518: -- Attachment: tika-server-docker-err-msg.txt > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > Attachments: tika-server-docker-err-msg.txt > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2591) Some tiffs (Big Endian with fax compression) are showing up as x-tarr
[ https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390253#comment-16390253 ] Hudson commented on TIKA-2591: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #5 (See [https://builds.apache.org/job/tika-branch-1x/5/]) TIKA-2591 -- Add workaround to identify TIFFs that might confuse (tallison: [https://github.com/apache/tika/commit/b4047eb2d92ee4ae8d8e02d12079232419775a73]) * (edit) CHANGES.txt * (add) tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipContainerDetectorTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java > Some tiffs (Big Endian with fax compression) are showing up as x-tarr > - > > Key: TIKA-2591 > URL: https://issues.apache.org/jira/browse/TIKA-2591 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.16 > Environment: Tika, running in a java application and a unit-test > (windows and mac environments) >Reporter: daniel schmidt >Priority: Major > Labels: newbie > Fix For: 1.18, 2.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > I have found that a certain tiff that we manage is now reporting > application/x-tar in Tika where it previously reported as a tiff > (image/tiff). > Observe this code in ArchiveStreamFactory, detect method. > // COMPRESS-117 - improve auto-recognition > if (signatureLength >= TAR_HEADER_SIZE) { > TarArchiveInputStream tais = null; > try { > tais = new TarArchiveInputStream(new > ByteArrayInputStream(tarHeader)); > // COMPRESS-191 - verify the header checksum > if (tais.getNextTarEntry().isCheckSumOK()) { > return TAR; > } > } catch (final Exception e) { // NOPMD // NOSONAR > // can generate IllegalArgumentException as well > // as IOException > // autodetection, simply not a TAR > // ignored > } finally { > IOUtils.closeQuietly(tais); > } > What if find is that most TIFs, when they get to tais.getNextTarEntry() fail > with an exception (i.e fall into the "simply not a tar" case). However this > tiff actually does NOT fail here. This somewhat makes sense as the internal > structure of a fax compressed tifs as a tar-like structure > Note, the CompositeDetector class eventually does recognize it as a proper > tiff as it loops through its detectors in its detect method. It is detected > as tiff in the MimeTypes class, which is one of the implementations of the > Detector interface > > public MediaType detect(InputStream input, Metadata metadata) > throws IOException { > MediaType type = MediaType.OCTET_STREAM; > for (Detector detector : getDetectors()) { > //short circuit via OverrideDetector > //can't rely on ordering because subsequent detector may > //change Override's to a specialization of Override's > if (detector instanceof OverrideDetector && > metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) { > return detector.detect(input, metadata); > } > MediaType detected = detector.detect(input, metadata); > if (registry.isSpecializationOf(detected, type)) { > type = detected; > } > } > return type; > However since Image/tiff isn't a specialization of application/x-tar it does > not replace the type with tiff. > My fix was to add a "" to the > definition for image/tiff in the tika-mimetypes.xml file > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml
[ https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390251#comment-16390251 ] Hudson commented on TIKA-2594: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #5 (See [https://builds.apache.org/job/tika-branch-1x/5/]) TIKA-2594 -- improve eml detection for those starting with Subject: and (tallison: [https://github.com/apache/tika/commit/b9e9e5b150aca851465e99017da6328c202ba127]) * (add) tika-parsers/src/test/resources/test-documents/testEML_embedded_xhtml_and_img.eml * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java > Mail detected as application/xhtml+xml > -- > > Key: TIKA-2594 > URL: https://issues.apache.org/jira/browse/TIKA-2594 > Project: Tika > Issue Type: Bug >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Major > Fix For: 1.18, 2.0.0 > > Attachments: TestMail_inline_xhtml_plus_image.eml > > > The attached mail (message/rfc822) with inline xhtml is recognized as > application/xhtml+xml > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390252#comment-16390252 ] Hudson commented on TIKA-2592: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #5 (See [https://builds.apache.org/job/tika-branch-1x/5/]) TIKA-2592 -- ignore charsets not supported by IANA in html meta-headers (tallison: [https://github.com/apache/tika/commit/164c9286fc0933051e86ce0a209250aa51bee3bf]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java * (edit) CHANGES.txt * (add) tika-parsers/src/main/resources/org/apache/tika/parser/html/StandardCharsets_unsupported_by_IANA.txt * (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java * (add) tika-parsers/src/test/resources/test-documents/testHTML_charset_utf16le.html * (add) tika-parsers/src/test/resources/test-documents/testHTML_charset_utf8.html > HTML with charset unicode handled as utf-16 instead utf-8 > - > > Key: TIKA-2592 > URL: https://issues.apache.org/jira/browse/TIKA-2592 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Minor > Fix For: 1.18, 2.0.0 > > Attachments: IANA Charset names.txt, > StandardCharsets_unsupported_by_IANA.txt, TestCharsetUnicodeHTML.html, > TestHTMLCharsetArabicCP1256.html, TestHTMLCharsetCP1256.html, > fix-for-TIKA2592-contributed-by-Andreas-Meier.patch > > > HTML files are detected as utf-16 when meta content is set to "unicode". > {code:XML} > > {code} > > Shouldn't the default be utf-8? > The attached sample file is shown correctly in: > Chromium Version 55.0.2883.75 > Firefox 50.1.0 > IE 11 > I am aware that there is no charset "unicode" (available character encodings: > [http://www.iana.org/assignments/character-sets/character-sets.xhtml|http://www.iana.org/assignments/character-sets/character-sets.xhtml]) > Unfortunately there are many wrong encodings used out there. > All unknown encodings should be validated or at least be set to default utf-8. > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy
[ https://issues.apache.org/jira/browse/TIKA-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390250#comment-16390250 ] Hudson commented on TIKA-2600: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #5 (See [https://builds.apache.org/job/tika-branch-1x/5/]) TIKA-2600 -- remove md5 checksum, and switch sha-1 to sha-512 for (tallison: [https://github.com/apache/tika/commit/32c19dee5bd4952f9f041f5fba218130fa02bdb5]) * (edit) pom.xml > Don't use md5 checksum due to changes to the release distribuition policy > - > > Key: TIKA-2600 > URL: https://issues.apache.org/jira/browse/TIKA-2600 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Blocker > Fix For: 1.18, 2.0.0 > > > To plagiarize from PDFBOX-4142: > The release distribution policy was changes with regard to the checksums to > be used: > Old policy : > MUST provide a MD5-file > SHOULD provide a SHA-file [SHA-512 recommended] > New policy : > MUST provide a SHA- or MD5-file > SHOULD provide a SHA-file > SHOULD NOT provide a MD5-file > see http://www.apache.org/dev/release-distribution for further details -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only
[ https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390254#comment-16390254 ] Hudson commented on TIKA-2590: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #5 (See [https://builds.apache.org/job/tika-branch-1x/5/]) TIKA-2590 -- revert listenForAllRecords = false thanks to Grigoriy (tallison: [https://github.com/apache/tika/commit/a9b4b3676f9476ae78246aa2f962006502243a24]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java > ExcelExtractor: cannot choose listening to the selected records only > > > Key: TIKA-2590 > URL: https://issues.apache.org/jira/browse/TIKA-2590 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Grigoriy Alekseev >Priority: Critical > Fix For: 1.18, 2.0.0 > > > The listenForAllRecords argument is being always reset to 'true', so the > 'else' branch is never reached. It may cause incorrect text extraction when > records with certain unsupported types (e.g. SharedFormula) are present in a > file. > {code:java} > public void processFile(DirectoryNode root, boolean > listenForAllRecords) > throws IOException, SAXException, TikaException { > // Set up listener and register the records we want to process > HSSFRequest hssfRequest = new HSSFRequest(); > listenForAllRecords = true; > if (listenForAllRecords) { > hssfRequest.addListenerForAllRecords(formatListener); > } else { > hssfRequest.addListener(formatListener, BOFRecord.sid); > hssfRequest.addListener(formatListener, EOFRecord.sid); > hssfRequest.addListener(formatListener, > DateWindow1904Record.sid); > hssfRequest.addListener(formatListener, CountryRecord.sid); > hssfRequest.addListener(formatListener, BoundSheetRecord.sid); > hssfRequest.addListener(formatListener, SSTRecord.sid); > hssfRequest.addListener(formatListener, FormulaRecord.sid); > hssfRequest.addListener(formatListener, LabelRecord.sid); > hssfRequest.addListener(formatListener, LabelSSTRecord.sid); > hssfRequest.addListener(formatListener, NumberRecord.sid); > hssfRequest.addListener(formatListener, RKRecord.sid); > hssfRequest.addListener(formatListener, StringRecord.sid); > hssfRequest.addListener(formatListener, HyperlinkRecord.sid); > hssfRequest.addListener(formatListener, TextObjectRecord.sid); > hssfRequest.addListener(formatListener, SeriesTextRecord.sid); > hssfRequest.addListener(formatListener, FormatRecord.sid); > hssfRequest.addListener(formatListener, > ExtendedFormatRecord.sid); > hssfRequest.addListener(formatListener, > DrawingGroupRecord.sid); > if > (extractor.officeParserConfig.getIncludeHeadersAndFooters()) { > hssfRequest.addListener(formatListener, HeaderRecord.sid); > hssfRequest.addListener(formatListener, FooterRecord.sid); > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2591) Some tiffs (Big Endian with fax compression) are showing up as x-tarr
[ https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390247#comment-16390247 ] Hudson commented on TIKA-2591: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1453 (See [https://builds.apache.org/job/Tika-trunk/1453/]) TIKA-2591 -- Add workaround to identify TIFFs that might confuse (tallison: [https://github.com/apache/tika/commit/462ee4744fd426cfdb12539435627b25e789c912]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java * (edit) CHANGES.txt * (add) tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipContainerDetectorTest.java > Some tiffs (Big Endian with fax compression) are showing up as x-tarr > - > > Key: TIKA-2591 > URL: https://issues.apache.org/jira/browse/TIKA-2591 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.16 > Environment: Tika, running in a java application and a unit-test > (windows and mac environments) >Reporter: daniel schmidt >Priority: Major > Labels: newbie > Fix For: 1.18, 2.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > I have found that a certain tiff that we manage is now reporting > application/x-tar in Tika where it previously reported as a tiff > (image/tiff). > Observe this code in ArchiveStreamFactory, detect method. > // COMPRESS-117 - improve auto-recognition > if (signatureLength >= TAR_HEADER_SIZE) { > TarArchiveInputStream tais = null; > try { > tais = new TarArchiveInputStream(new > ByteArrayInputStream(tarHeader)); > // COMPRESS-191 - verify the header checksum > if (tais.getNextTarEntry().isCheckSumOK()) { > return TAR; > } > } catch (final Exception e) { // NOPMD // NOSONAR > // can generate IllegalArgumentException as well > // as IOException > // autodetection, simply not a TAR > // ignored > } finally { > IOUtils.closeQuietly(tais); > } > What if find is that most TIFs, when they get to tais.getNextTarEntry() fail > with an exception (i.e fall into the "simply not a tar" case). However this > tiff actually does NOT fail here. This somewhat makes sense as the internal > structure of a fax compressed tifs as a tar-like structure > Note, the CompositeDetector class eventually does recognize it as a proper > tiff as it loops through its detectors in its detect method. It is detected > as tiff in the MimeTypes class, which is one of the implementations of the > Detector interface > > public MediaType detect(InputStream input, Metadata metadata) > throws IOException { > MediaType type = MediaType.OCTET_STREAM; > for (Detector detector : getDetectors()) { > //short circuit via OverrideDetector > //can't rely on ordering because subsequent detector may > //change Override's to a specialization of Override's > if (detector instanceof OverrideDetector && > metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) { > return detector.detect(input, metadata); > } > MediaType detected = detector.detect(input, metadata); > if (registry.isSpecializationOf(detected, type)) { > type = detected; > } > } > return type; > However since Image/tiff isn't a specialization of application/x-tar it does > not replace the type with tiff. > My fix was to add a "" to the > definition for image/tiff in the tika-mimetypes.xml file > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only
[ https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390246#comment-16390246 ] Hudson commented on TIKA-2590: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1453 (See [https://builds.apache.org/job/Tika-trunk/1453/]) TIKA-2590: restore the client's ability to choose what Excel file (g.alekseev: [https://github.com/apache/tika/commit/c56c7c41a6c51e4cd4dac78b693bd883f1329264]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java TIKA-2590 update Changes.txt (tallison: [https://github.com/apache/tika/commit/947334cbf40bc6efef1cb488749213724bedb171]) * (edit) CHANGES.txt > ExcelExtractor: cannot choose listening to the selected records only > > > Key: TIKA-2590 > URL: https://issues.apache.org/jira/browse/TIKA-2590 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Grigoriy Alekseev >Priority: Critical > Fix For: 1.18, 2.0.0 > > > The listenForAllRecords argument is being always reset to 'true', so the > 'else' branch is never reached. It may cause incorrect text extraction when > records with certain unsupported types (e.g. SharedFormula) are present in a > file. > {code:java} > public void processFile(DirectoryNode root, boolean > listenForAllRecords) > throws IOException, SAXException, TikaException { > // Set up listener and register the records we want to process > HSSFRequest hssfRequest = new HSSFRequest(); > listenForAllRecords = true; > if (listenForAllRecords) { > hssfRequest.addListenerForAllRecords(formatListener); > } else { > hssfRequest.addListener(formatListener, BOFRecord.sid); > hssfRequest.addListener(formatListener, EOFRecord.sid); > hssfRequest.addListener(formatListener, > DateWindow1904Record.sid); > hssfRequest.addListener(formatListener, CountryRecord.sid); > hssfRequest.addListener(formatListener, BoundSheetRecord.sid); > hssfRequest.addListener(formatListener, SSTRecord.sid); > hssfRequest.addListener(formatListener, FormulaRecord.sid); > hssfRequest.addListener(formatListener, LabelRecord.sid); > hssfRequest.addListener(formatListener, LabelSSTRecord.sid); > hssfRequest.addListener(formatListener, NumberRecord.sid); > hssfRequest.addListener(formatListener, RKRecord.sid); > hssfRequest.addListener(formatListener, StringRecord.sid); > hssfRequest.addListener(formatListener, HyperlinkRecord.sid); > hssfRequest.addListener(formatListener, TextObjectRecord.sid); > hssfRequest.addListener(formatListener, SeriesTextRecord.sid); > hssfRequest.addListener(formatListener, FormatRecord.sid); > hssfRequest.addListener(formatListener, > ExtendedFormatRecord.sid); > hssfRequest.addListener(formatListener, > DrawingGroupRecord.sid); > if > (extractor.officeParserConfig.getIncludeHeadersAndFooters()) { > hssfRequest.addListener(formatListener, HeaderRecord.sid); > hssfRequest.addListener(formatListener, FooterRecord.sid); > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390241#comment-16390241 ] Dave Meikle commented on TIKA-1518: --- {quote}I do have Docker installed, [0] but it is Windows, and I've noticed some, um, areas for improvement in Docker on Windows. {quote} I've found on Windows I have had to enable the "Expose daemon on tcp://localhost:2375 without TLS" in Docker for Windows to talk to it with many of the clients out there. Does this solve it for you? > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
RE: Tika 1.18?
All, I think I've made the updates that I wanted to make sure got in to 1.18. It looks like PDFBox is going to start their release cycle shortly. Should we wait for PDFBox 2.0.9? That may add a week or two to our release, although, frankly, it might not. We can start running the regression tests March 9(ish) and see if anything dire appears... Cheers, Tim
[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390216#comment-16390216 ] Tim Allison edited comment on TIKA-1518 at 3/7/18 8:58 PM: --- bq. this is me getting too excited ?! I do have Docker installed, [0] but it is Windows, and I've noticed some, um, areas for improvement in Docker on Windows. Thank you! [0] {noformat} C:\stuff>docker -v Docker version 17.12.0-ce, build c97c6d6 {noformat} was (Author: talli...@mitre.org): bq. this is me getting too excited ?! I do have Docker installed, [0] but it is Windows, and I've noticed some, um, areas for improvement in Docker on Windows. Thank you! [0] {noformat} C:\stuff>docker -v Docker version 17.12.0-ce, build c97c6d6 {nformat} > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390216#comment-16390216 ] Tim Allison edited comment on TIKA-1518 at 3/7/18 8:58 PM: --- bq. this is me getting too excited ?! I do have Docker installed, [0] but it is Windows, and I've noticed some, um, areas for improvement in Docker on Windows. Thank you! [0] {noformat} C:\stuff>docker -v Docker version 17.12.0-ce, build c97c6d6 {nformat} was (Author: talli...@mitre.org): bq. this is me getting too excited ?! I do have Docker installed, but it is Windows, and I've noticed some, um, areas for improvement in Docker on Windows. Thank you! > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390216#comment-16390216 ] Tim Allison commented on TIKA-1518: --- bq. this is me getting too excited ?! I do have Docker installed, but it is Windows, and I've noticed some, um, areas for improvement in Docker on Windows. Thank you! > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390202#comment-16390202 ] Dave Meikle edited comment on TIKA-1518 at 3/7/18 8:51 PM: --- Sorry [~talli...@mitre.org] - this is me getting too excited. I'll need to remove it from being hooked on the "build" phase so those without Docker can build without this! Will do this just now. was (Author: davemeikle): Sorry [~talli...@mitre.org] - this is me getting too excited. I'll need to remove it from being hooked on the "build" phase so those without Docker can build without this! Will do this just now. > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390202#comment-16390202 ] Dave Meikle commented on TIKA-1518: --- Sorry [~talli...@mitre.org] - this is me getting too excited. I'll need to remove it from being hooked on the "build" phase so those without Docker can build without this! Will do this just now. > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez >Assignee: Dave Meikle >Priority: Major > Fix For: 2.0, 1.17 > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml
[ https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390197#comment-16390197 ] Tim Allison commented on TIKA-2594: --- [~lfcnassif], I added the mime defs you suggested above just now to both 2.0.0 and 1.18. Thank you! > Mail detected as application/xhtml+xml > -- > > Key: TIKA-2594 > URL: https://issues.apache.org/jira/browse/TIKA-2594 > Project: Tika > Issue Type: Bug >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Major > Fix For: 1.18, 2.0.0 > > Attachments: TestMail_inline_xhtml_plus_image.eml > > > The attached mail (message/rfc822) with inline xhtml is recognized as > application/xhtml+xml > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2591) Some tiffs (Big Endian with fax compression) are showing up as x-tarr
[ https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390182#comment-16390182 ] Hudson commented on TIKA-2591: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #214 (See [https://builds.apache.org/job/tika-2.x-windows/214/]) TIKA-2591 -- Add workaround to identify TIFFs that might confuse (tallison: rev 462ee4744fd426cfdb12539435627b25e789c912) * (edit) CHANGES.txt * (add) tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipContainerDetectorTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java > Some tiffs (Big Endian with fax compression) are showing up as x-tarr > - > > Key: TIKA-2591 > URL: https://issues.apache.org/jira/browse/TIKA-2591 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.16 > Environment: Tika, running in a java application and a unit-test > (windows and mac environments) >Reporter: daniel schmidt >Priority: Major > Labels: newbie > Fix For: 1.18, 2.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > I have found that a certain tiff that we manage is now reporting > application/x-tar in Tika where it previously reported as a tiff > (image/tiff). > Observe this code in ArchiveStreamFactory, detect method. > // COMPRESS-117 - improve auto-recognition > if (signatureLength >= TAR_HEADER_SIZE) { > TarArchiveInputStream tais = null; > try { > tais = new TarArchiveInputStream(new > ByteArrayInputStream(tarHeader)); > // COMPRESS-191 - verify the header checksum > if (tais.getNextTarEntry().isCheckSumOK()) { > return TAR; > } > } catch (final Exception e) { // NOPMD // NOSONAR > // can generate IllegalArgumentException as well > // as IOException > // autodetection, simply not a TAR > // ignored > } finally { > IOUtils.closeQuietly(tais); > } > What if find is that most TIFs, when they get to tais.getNextTarEntry() fail > with an exception (i.e fall into the "simply not a tar" case). However this > tiff actually does NOT fail here. This somewhat makes sense as the internal > structure of a fax compressed tifs as a tar-like structure > Note, the CompositeDetector class eventually does recognize it as a proper > tiff as it loops through its detectors in its detect method. It is detected > as tiff in the MimeTypes class, which is one of the implementations of the > Detector interface > > public MediaType detect(InputStream input, Metadata metadata) > throws IOException { > MediaType type = MediaType.OCTET_STREAM; > for (Detector detector : getDetectors()) { > //short circuit via OverrideDetector > //can't rely on ordering because subsequent detector may > //change Override's to a specialization of Override's > if (detector instanceof OverrideDetector && > metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) { > return detector.detect(input, metadata); > } > MediaType detected = detector.detect(input, metadata); > if (registry.isSpecializationOf(detected, type)) { > type = detected; > } > } > return type; > However since Image/tiff isn't a specialization of application/x-tar it does > not replace the type with tiff. > My fix was to add a "" to the > definition for image/tiff in the tika-mimetypes.xml file > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only
[ https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390181#comment-16390181 ] Hudson commented on TIKA-2590: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #214 (See [https://builds.apache.org/job/tika-2.x-windows/214/]) TIKA-2590: restore the client's ability to choose what Excel file (g.alekseev: rev c56c7c41a6c51e4cd4dac78b693bd883f1329264) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java TIKA-2590 update Changes.txt (tallison: rev 947334cbf40bc6efef1cb488749213724bedb171) * (edit) CHANGES.txt > ExcelExtractor: cannot choose listening to the selected records only > > > Key: TIKA-2590 > URL: https://issues.apache.org/jira/browse/TIKA-2590 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Grigoriy Alekseev >Priority: Critical > Fix For: 1.18, 2.0.0 > > > The listenForAllRecords argument is being always reset to 'true', so the > 'else' branch is never reached. It may cause incorrect text extraction when > records with certain unsupported types (e.g. SharedFormula) are present in a > file. > {code:java} > public void processFile(DirectoryNode root, boolean > listenForAllRecords) > throws IOException, SAXException, TikaException { > // Set up listener and register the records we want to process > HSSFRequest hssfRequest = new HSSFRequest(); > listenForAllRecords = true; > if (listenForAllRecords) { > hssfRequest.addListenerForAllRecords(formatListener); > } else { > hssfRequest.addListener(formatListener, BOFRecord.sid); > hssfRequest.addListener(formatListener, EOFRecord.sid); > hssfRequest.addListener(formatListener, > DateWindow1904Record.sid); > hssfRequest.addListener(formatListener, CountryRecord.sid); > hssfRequest.addListener(formatListener, BoundSheetRecord.sid); > hssfRequest.addListener(formatListener, SSTRecord.sid); > hssfRequest.addListener(formatListener, FormulaRecord.sid); > hssfRequest.addListener(formatListener, LabelRecord.sid); > hssfRequest.addListener(formatListener, LabelSSTRecord.sid); > hssfRequest.addListener(formatListener, NumberRecord.sid); > hssfRequest.addListener(formatListener, RKRecord.sid); > hssfRequest.addListener(formatListener, StringRecord.sid); > hssfRequest.addListener(formatListener, HyperlinkRecord.sid); > hssfRequest.addListener(formatListener, TextObjectRecord.sid); > hssfRequest.addListener(formatListener, SeriesTextRecord.sid); > hssfRequest.addListener(formatListener, FormatRecord.sid); > hssfRequest.addListener(formatListener, > ExtendedFormatRecord.sid); > hssfRequest.addListener(formatListener, > DrawingGroupRecord.sid); > if > (extractor.officeParserConfig.getIncludeHeadersAndFooters()) { > hssfRequest.addListener(formatListener, HeaderRecord.sid); > hssfRequest.addListener(formatListener, FooterRecord.sid); > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2527) Typos in tika-mimetypes.xml
[ https://issues.apache.org/jira/browse/TIKA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2527. --- Resolution: Fixed Fix Version/s: 2.0.0 1.18 Thank you, again, [~AndreasMeier]! > Typos in tika-mimetypes.xml > --- > > Key: TIKA-2527 > URL: https://issues.apache.org/jira/browse/TIKA-2527 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 2.0, 1.16, 1.17, 1.18 > Environment: ALL >Reporter: Andreas Meier >Priority: Minor > Fix For: 1.18, 2.0.0 > > Attachments: enhancement-for-TIKA2527-contributed-by-AMeier.patch, > fix-for-TIKA2527-contributed-by-AMeier-Fixed-adpcmmi.patch, > fix-for-binhexmatch-TIKA2527-contributed-by-AMeier.patch > > > Are these mimetypes in tika-mimetypes.xml > audio/x-adbcm instead audio/x-adpcm > {code:xml} {code} > and > audio/x-dec-adbcm instead audio/x-dec-adpcm > {code:xml} {code} > intended? > Couldn't find these mimetypes. > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390150#comment-16390150 ] Hudson commented on TIKA-2592: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1452 (See [https://builds.apache.org/job/Tika-trunk/1452/]) TIKA-2592 -- ignore charsets not supported by IANA in html meta-headers (tallison: [https://github.com/apache/tika/commit/7e2b1e7534268b40c8b4ef3ee20ed708bf2e383c]) * (add) tika-parsers/src/test/resources/test-documents/testHTML_charset_utf8.html * (add) tika-parsers/src/main/resources/org/apache/tika/parser/html/StandardCharsets_unsupported_by_IANA.txt * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java * (add) tika-parsers/src/test/resources/test-documents/testHTML_charset_utf16le.html * (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java * (edit) CHANGES.txt > HTML with charset unicode handled as utf-16 instead utf-8 > - > > Key: TIKA-2592 > URL: https://issues.apache.org/jira/browse/TIKA-2592 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Minor > Fix For: 1.18, 2.0.0 > > Attachments: IANA Charset names.txt, > StandardCharsets_unsupported_by_IANA.txt, TestCharsetUnicodeHTML.html, > TestHTMLCharsetArabicCP1256.html, TestHTMLCharsetCP1256.html, > fix-for-TIKA2592-contributed-by-Andreas-Meier.patch > > > HTML files are detected as utf-16 when meta content is set to "unicode". > {code:XML} > > {code} > > Shouldn't the default be utf-8? > The attached sample file is shown correctly in: > Chromium Version 55.0.2883.75 > Firefox 50.1.0 > IE 11 > I am aware that there is no charset "unicode" (available character encodings: > [http://www.iana.org/assignments/character-sets/character-sets.xhtml|http://www.iana.org/assignments/character-sets/character-sets.xhtml]) > Unfortunately there are many wrong encodings used out there. > All unknown encodings should be validated or at least be set to default utf-8. > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only
[ https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2590. --- Resolution: Fixed Fix Version/s: 1.18 > ExcelExtractor: cannot choose listening to the selected records only > > > Key: TIKA-2590 > URL: https://issues.apache.org/jira/browse/TIKA-2590 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Grigoriy Alekseev >Priority: Critical > Fix For: 1.18, 2.0.0 > > > The listenForAllRecords argument is being always reset to 'true', so the > 'else' branch is never reached. It may cause incorrect text extraction when > records with certain unsupported types (e.g. SharedFormula) are present in a > file. > {code:java} > public void processFile(DirectoryNode root, boolean > listenForAllRecords) > throws IOException, SAXException, TikaException { > // Set up listener and register the records we want to process > HSSFRequest hssfRequest = new HSSFRequest(); > listenForAllRecords = true; > if (listenForAllRecords) { > hssfRequest.addListenerForAllRecords(formatListener); > } else { > hssfRequest.addListener(formatListener, BOFRecord.sid); > hssfRequest.addListener(formatListener, EOFRecord.sid); > hssfRequest.addListener(formatListener, > DateWindow1904Record.sid); > hssfRequest.addListener(formatListener, CountryRecord.sid); > hssfRequest.addListener(formatListener, BoundSheetRecord.sid); > hssfRequest.addListener(formatListener, SSTRecord.sid); > hssfRequest.addListener(formatListener, FormulaRecord.sid); > hssfRequest.addListener(formatListener, LabelRecord.sid); > hssfRequest.addListener(formatListener, LabelSSTRecord.sid); > hssfRequest.addListener(formatListener, NumberRecord.sid); > hssfRequest.addListener(formatListener, RKRecord.sid); > hssfRequest.addListener(formatListener, StringRecord.sid); > hssfRequest.addListener(formatListener, HyperlinkRecord.sid); > hssfRequest.addListener(formatListener, TextObjectRecord.sid); > hssfRequest.addListener(formatListener, SeriesTextRecord.sid); > hssfRequest.addListener(formatListener, FormatRecord.sid); > hssfRequest.addListener(formatListener, > ExtendedFormatRecord.sid); > hssfRequest.addListener(formatListener, > DrawingGroupRecord.sid); > if > (extractor.officeParserConfig.getIncludeHeadersAndFooters()) { > hssfRequest.addListener(formatListener, HeaderRecord.sid); > hssfRequest.addListener(formatListener, FooterRecord.sid); > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2591) Some tiffs (Big Endian with fax compression) are showing up as x-tarr
[ https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2591. --- Resolution: Fixed Fix Version/s: 2.0.0 Thank you [~schmiddc] and [~gagravarr]! > Some tiffs (Big Endian with fax compression) are showing up as x-tarr > - > > Key: TIKA-2591 > URL: https://issues.apache.org/jira/browse/TIKA-2591 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.16 > Environment: Tika, running in a java application and a unit-test > (windows and mac environments) >Reporter: daniel schmidt >Priority: Major > Labels: newbie > Fix For: 1.18, 2.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > I have found that a certain tiff that we manage is now reporting > application/x-tar in Tika where it previously reported as a tiff > (image/tiff). > Observe this code in ArchiveStreamFactory, detect method. > // COMPRESS-117 - improve auto-recognition > if (signatureLength >= TAR_HEADER_SIZE) { > TarArchiveInputStream tais = null; > try { > tais = new TarArchiveInputStream(new > ByteArrayInputStream(tarHeader)); > // COMPRESS-191 - verify the header checksum > if (tais.getNextTarEntry().isCheckSumOK()) { > return TAR; > } > } catch (final Exception e) { // NOPMD // NOSONAR > // can generate IllegalArgumentException as well > // as IOException > // autodetection, simply not a TAR > // ignored > } finally { > IOUtils.closeQuietly(tais); > } > What if find is that most TIFs, when they get to tais.getNextTarEntry() fail > with an exception (i.e fall into the "simply not a tar" case). However this > tiff actually does NOT fail here. This somewhat makes sense as the internal > structure of a fax compressed tifs as a tar-like structure > Note, the CompositeDetector class eventually does recognize it as a proper > tiff as it loops through its detectors in its detect method. It is detected > as tiff in the MimeTypes class, which is one of the implementations of the > Detector interface > > public MediaType detect(InputStream input, Metadata metadata) > throws IOException { > MediaType type = MediaType.OCTET_STREAM; > for (Detector detector : getDetectors()) { > //short circuit via OverrideDetector > //can't rely on ordering because subsequent detector may > //change Override's to a specialization of Override's > if (detector instanceof OverrideDetector && > metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) { > return detector.detect(input, metadata); > } > MediaType detected = detector.detect(input, metadata); > if (registry.isSpecializationOf(detected, type)) { > type = detected; > } > } > return type; > However since Image/tiff isn't a specialization of application/x-tar it does > not replace the type with tiff. > My fix was to add a "" to the > definition for image/tiff in the tika-mimetypes.xml file > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2592. --- Resolution: Fixed Fix Version/s: 2.0.0 1.18 Thank you [~AndreasMeier] and [~kkrugler]! > HTML with charset unicode handled as utf-16 instead utf-8 > - > > Key: TIKA-2592 > URL: https://issues.apache.org/jira/browse/TIKA-2592 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Minor > Fix For: 1.18, 2.0.0 > > Attachments: IANA Charset names.txt, > StandardCharsets_unsupported_by_IANA.txt, TestCharsetUnicodeHTML.html, > TestHTMLCharsetArabicCP1256.html, TestHTMLCharsetCP1256.html, > fix-for-TIKA2592-contributed-by-Andreas-Meier.patch > > > HTML files are detected as utf-16 when meta content is set to "unicode". > {code:XML} > > {code} > > Shouldn't the default be utf-8? > The attached sample file is shown correctly in: > Chromium Version 55.0.2883.75 > Firefox 50.1.0 > IE 11 > I am aware that there is no charset "unicode" (available character encodings: > [http://www.iana.org/assignments/character-sets/character-sets.xhtml|http://www.iana.org/assignments/character-sets/character-sets.xhtml]) > Unfortunately there are many wrong encodings used out there. > All unknown encodings should be validated or at least be set to default utf-8. > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2594) Mail detected as application/xhtml+xml
[ https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2594. --- Resolution: Fixed Fix Version/s: 2.0.0 1.18 Thank you, [~AndreasMeier]! > Mail detected as application/xhtml+xml > -- > > Key: TIKA-2594 > URL: https://issues.apache.org/jira/browse/TIKA-2594 > Project: Tika > Issue Type: Bug >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Major > Fix For: 1.18, 2.0.0 > > Attachments: TestMail_inline_xhtml_plus_image.eml > > > The attached mail (message/rfc822) with inline xhtml is recognized as > application/xhtml+xml > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only
[ https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390107#comment-16390107 ] ASF GitHub Bot commented on TIKA-2590: -- tballison closed pull request #225: TIKA-2590: restore the client's ability to choose what Excel file rec… URL: https://github.com/apache/tika/pull/225 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java index 9146b8c7b..4ea8068de 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java @@ -284,7 +284,6 @@ public void processFile(DirectoryNode root, boolean listenForAllRecords) // Set up listener and register the records we want to process HSSFRequest hssfRequest = new HSSFRequest(); -listenForAllRecords = true; if (listenForAllRecords) { hssfRequest.addListenerForAllRecords(formatListener); } else { This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > ExcelExtractor: cannot choose listening to the selected records only > > > Key: TIKA-2590 > URL: https://issues.apache.org/jira/browse/TIKA-2590 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Grigoriy Alekseev >Priority: Critical > Fix For: 2.0.0 > > > The listenForAllRecords argument is being always reset to 'true', so the > 'else' branch is never reached. It may cause incorrect text extraction when > records with certain unsupported types (e.g. SharedFormula) are present in a > file. > {code:java} > public void processFile(DirectoryNode root, boolean > listenForAllRecords) > throws IOException, SAXException, TikaException { > // Set up listener and register the records we want to process > HSSFRequest hssfRequest = new HSSFRequest(); > listenForAllRecords = true; > if (listenForAllRecords) { > hssfRequest.addListenerForAllRecords(formatListener); > } else { > hssfRequest.addListener(formatListener, BOFRecord.sid); > hssfRequest.addListener(formatListener, EOFRecord.sid); > hssfRequest.addListener(formatListener, > DateWindow1904Record.sid); > hssfRequest.addListener(formatListener, CountryRecord.sid); > hssfRequest.addListener(formatListener, BoundSheetRecord.sid); > hssfRequest.addListener(formatListener, SSTRecord.sid); > hssfRequest.addListener(formatListener, FormulaRecord.sid); > hssfRequest.addListener(formatListener, LabelRecord.sid); > hssfRequest.addListener(formatListener, LabelSSTRecord.sid); > hssfRequest.addListener(formatListener, NumberRecord.sid); > hssfRequest.addListener(formatListener, RKRecord.sid); > hssfRequest.addListener(formatListener, StringRecord.sid); > hssfRequest.addListener(formatListener, HyperlinkRecord.sid); > hssfRequest.addListener(formatListener, TextObjectRecord.sid); > hssfRequest.addListener(formatListener, SeriesTextRecord.sid); > hssfRequest.addListener(formatListener, FormatRecord.sid); > hssfRequest.addListener(formatListener, > ExtendedFormatRecord.sid); > hssfRequest.addListener(formatListener, > DrawingGroupRecord.sid); > if > (extractor.officeParserConfig.getIncludeHeadersAndFooters()) { > hssfRequest.addListener(formatListener, HeaderRecord.sid); > hssfRequest.addListener(formatListener, FooterRecord.sid); > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390087#comment-16390087 ] Hudson commented on TIKA-2592: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #213 (See [https://builds.apache.org/job/tika-2.x-windows/213/]) TIKA-2592 -- ignore charsets not supported by IANA in html meta-headers (tallison: rev 7e2b1e7534268b40c8b4ef3ee20ed708bf2e383c) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java * (add) tika-parsers/src/test/resources/test-documents/testHTML_charset_utf8.html * (edit) CHANGES.txt * (add) tika-parsers/src/main/resources/org/apache/tika/parser/html/StandardCharsets_unsupported_by_IANA.txt * (add) tika-parsers/src/test/resources/test-documents/testHTML_charset_utf16le.html * (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java > HTML with charset unicode handled as utf-16 instead utf-8 > - > > Key: TIKA-2592 > URL: https://issues.apache.org/jira/browse/TIKA-2592 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Minor > Attachments: IANA Charset names.txt, > StandardCharsets_unsupported_by_IANA.txt, TestCharsetUnicodeHTML.html, > TestHTMLCharsetArabicCP1256.html, TestHTMLCharsetCP1256.html, > fix-for-TIKA2592-contributed-by-Andreas-Meier.patch > > > HTML files are detected as utf-16 when meta content is set to "unicode". > {code:XML} > > {code} > > Shouldn't the default be utf-8? > The attached sample file is shown correctly in: > Chromium Version 55.0.2883.75 > Firefox 50.1.0 > IE 11 > I am aware that there is no charset "unicode" (available character encodings: > [http://www.iana.org/assignments/character-sets/character-sets.xhtml|http://www.iana.org/assignments/character-sets/character-sets.xhtml]) > Unfortunately there are many wrong encodings used out there. > All unknown encodings should be validated or at least be set to default utf-8. > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
tika-2.x-windows - Build # 213 - Still Failing
The Apache Jenkins build system has built tika-2.x-windows (build #213) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/213/ to view the results.
[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml
[ https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390059#comment-16390059 ] Hudson commented on TIKA-2594: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1451 (See [https://builds.apache.org/job/Tika-trunk/1451/]) TIKA-2594 -- improve eml detection for those starting with Subject: and (tallison: [https://github.com/apache/tika/commit/09031046e5bece75ed22d9ee9b184ec49a14f99a]) * (add) tika-parsers/src/test/resources/test-documents/testEML_embedded_xhtml_and_img.eml * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java > Mail detected as application/xhtml+xml > -- > > Key: TIKA-2594 > URL: https://issues.apache.org/jira/browse/TIKA-2594 > Project: Tika > Issue Type: Bug >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Major > Attachments: TestMail_inline_xhtml_plus_image.eml > > > The attached mail (message/rfc822) with inline xhtml is recognized as > application/xhtml+xml > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml
[ https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390028#comment-16390028 ] Luis Filipe Nassif commented on TIKA-2594: -- We have used that magic restricted to 0:1000 for a long time, with very few false positives, along with: {code} {code} > Mail detected as application/xhtml+xml > -- > > Key: TIKA-2594 > URL: https://issues.apache.org/jira/browse/TIKA-2594 > Project: Tika > Issue Type: Bug >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Major > Attachments: TestMail_inline_xhtml_plus_image.eml > > > The attached mail (message/rfc822) with inline xhtml is recognized as > application/xhtml+xml > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml
[ https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389944#comment-16389944 ] Hudson commented on TIKA-2594: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #212 (See [https://builds.apache.org/job/tika-2.x-windows/212/]) TIKA-2594 -- improve eml detection for those starting with Subject: and (tallison: rev 09031046e5bece75ed22d9ee9b184ec49a14f99a) * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java * (add) tika-parsers/src/test/resources/test-documents/testEML_embedded_xhtml_and_img.eml > Mail detected as application/xhtml+xml > -- > > Key: TIKA-2594 > URL: https://issues.apache.org/jira/browse/TIKA-2594 > Project: Tika > Issue Type: Bug >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Major > Attachments: TestMail_inline_xhtml_plus_image.eml > > > The attached mail (message/rfc822) with inline xhtml is recognized as > application/xhtml+xml > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
tika-2.x-windows - Build # 212 - Still Failing
The Apache Jenkins build system has built tika-2.x-windows (build #212) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/212/ to view the results.
[jira] [Commented] (TIKA-1466) Enable overriding of mimetype glob pattern definitions
[ https://issues.apache.org/jira/browse/TIKA-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389938#comment-16389938 ] Luis Filipe Nassif commented on TIKA-1466: -- I thought about logging any custom-mimetype override applied, so the user will be warned about that. Maybe additionally creating a specific attribute in mimetype definition xml to configure it must override the default one instead of aborting. About multiple conflicting custom mimes from different (external) projetcs, Tika currently aborts and it is already a problem now. So I think it needs additional discussion and should not be addressed in the next release. Will copy/paste this discussion in the jira issue. But I would like to see fixed the detection of MTS videos, but it conflicts with another existing mime glob. Any workaround for this specific case? If yes, I can open a different ticket. > Enable overriding of mimetype glob pattern definitions > -- > > Key: TIKA-1466 > URL: https://issues.apache.org/jira/browse/TIKA-1466 > Project: Tika > Issue Type: Improvement > Components: mime >Affects Versions: 1.6 >Reporter: Luis Filipe Nassif >Priority: Major > > I think it is important to enable an overriding of the default > tika-mimetypes.xml glob pattern definitions within a custom-mimetypes.xml. > Currently, you can not define in a custom mimetype an already used glob > pattern, even if you redefine in custom-mimetypes.xml the first mimetype > using the conflicting glob pattern. The same extension can be used by > different applications in different domains or datasets. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2601) Invalid XHTML output for some WORD documents
Filip created TIKA-2601: --- Summary: Invalid XHTML output for some WORD documents Key: TIKA-2601 URL: https://issues.apache.org/jira/browse/TIKA-2601 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.17 Environment: Linked is a sample document with its corresponding output. Reporter: Filip Attachments: Test.doc, test.html In some WORD (.doc, .docx) documents the XHTML elements are not closed properly. This usually happens when there are link elements () as well as italic or bold elements (). Fix should be done in [https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values
[ https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389884#comment-16389884 ] Radim Rehurek edited comment on TIKA-1020 at 3/7/18 5:57 PM: - We just hit this bug too. I say "bug" because Excel spreadsheets are really structured tables, just like [~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes little sense. [~tpalsulich] IMO empty rows could be reported too, but in our use-case, the critical thing is not to have jumbled records (caused by missing cells in a single row). was (Author: piskvorky): We just hit this bug too. I say "bug" because Excel spreadsheets are really structured tables, just like [~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes little sense. [~tpalsulich] empty rows could be reported too, but in our use-case, the critical thing is not to have jumbled records (caused by missing cells in a single row). > Excel 2010 parser missing cell values are not reported resulting in missing > columns values > -- > > Key: TIKA-1020 > URL: https://issues.apache.org/jira/browse/TIKA-1020 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 > Environment: java 1.6 & 1.7 >Reporter: Neil Blue >Priority: Major > Labels: newbie, patch > > When parting an excel 2010 table, if a worksheet has a missing value, then it > is not reported in the sax handler. As a result a missing value can result in > unordered data. > For example given the table: > {code:title=Bar.java|borderStyle=solid} > A B B > 1 2 3 > 4 6 > 7 8 9 > {code} > the returned sax handler reports elements > {code:title=Bar.java|borderStyle=solid} > ABC > 123 > 46 > 789 > {code} > As a result the handler can detect that the third row as incomplete cell > values but it is ambiguous which columns have missing data. > As a possible fix for this excel 2010 xml data contains the cell reference > value, which could be returned to the sax handler as an attribute. > {code:title=Bar.java|borderStyle=solid} > *** XSSFExcelExtractorDecorator.java2012-11-08 10:51:55.881207100 + > --- XSSFExcelExtractorDecorator.java.1 2012-11-08 10:59:02.972223700 + > *** > *** 200,206 > > public void cell(String cellRef, String formattedValue) { > try { > ! xhtml.startElement("td"); > >// Main cell contents >xhtml.characters(formattedValue); > --- 200,208 > > public void cell(String cellRef, String formattedValue) { > try { > ! AttributesImpl attributes = new AttributesImpl(); > ! attributes.addAttribute(null, "cellRef", "cellRef", null, > cellRef) ; > ! xhtml.startElement("td",attributes); > >// Main cell contents >xhtml.characters(formattedValue); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values
[ https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389884#comment-16389884 ] Radim Rehurek edited comment on TIKA-1020 at 3/7/18 5:57 PM: - We just hit this bug too. I say "bug" because Excel spreadsheets are really structured tables, just like [~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes little sense. [~tpalsulich] IMO empty rows could be reported too, but in our use-case, the critical thing is not to have jumbled records caused by empty cells in a single row. was (Author: piskvorky): We just hit this bug too. I say "bug" because Excel spreadsheets are really structured tables, just like [~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes little sense. [~tpalsulich] IMO empty rows could be reported too, but in our use-case, the critical thing is not to have jumbled records (caused by missing cells in a single row). > Excel 2010 parser missing cell values are not reported resulting in missing > columns values > -- > > Key: TIKA-1020 > URL: https://issues.apache.org/jira/browse/TIKA-1020 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 > Environment: java 1.6 & 1.7 >Reporter: Neil Blue >Priority: Major > Labels: newbie, patch > > When parting an excel 2010 table, if a worksheet has a missing value, then it > is not reported in the sax handler. As a result a missing value can result in > unordered data. > For example given the table: > {code:title=Bar.java|borderStyle=solid} > A B B > 1 2 3 > 4 6 > 7 8 9 > {code} > the returned sax handler reports elements > {code:title=Bar.java|borderStyle=solid} > ABC > 123 > 46 > 789 > {code} > As a result the handler can detect that the third row as incomplete cell > values but it is ambiguous which columns have missing data. > As a possible fix for this excel 2010 xml data contains the cell reference > value, which could be returned to the sax handler as an attribute. > {code:title=Bar.java|borderStyle=solid} > *** XSSFExcelExtractorDecorator.java2012-11-08 10:51:55.881207100 + > --- XSSFExcelExtractorDecorator.java.1 2012-11-08 10:59:02.972223700 + > *** > *** 200,206 > > public void cell(String cellRef, String formattedValue) { > try { > ! xhtml.startElement("td"); > >// Main cell contents >xhtml.characters(formattedValue); > --- 200,208 > > public void cell(String cellRef, String formattedValue) { > try { > ! AttributesImpl attributes = new AttributesImpl(); > ! attributes.addAttribute(null, "cellRef", "cellRef", null, > cellRef) ; > ! xhtml.startElement("td",attributes); > >// Main cell contents >xhtml.characters(formattedValue); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values
[ https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389884#comment-16389884 ] Radim Rehurek edited comment on TIKA-1020 at 3/7/18 5:56 PM: - We just hit this bug too. I say "bug" because Excel spreadsheets are really structured tables, just like [~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes little sense. [~tpalsulich] empty rows could be reported too, but in our use-case, the critical thing is not to have jumbled records (caused by missing cells in a single row). was (Author: piskvorky): We just hit this bug too. I say "bug" because Excel spreadsheets are really tables with rows, just like [~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes little sense. [~tpalsulich] empty rows could be reported too, but in our use-case, the critical thing is not to have jumbled records (caused by missing cells in a single row). > Excel 2010 parser missing cell values are not reported resulting in missing > columns values > -- > > Key: TIKA-1020 > URL: https://issues.apache.org/jira/browse/TIKA-1020 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 > Environment: java 1.6 & 1.7 >Reporter: Neil Blue >Priority: Major > Labels: newbie, patch > > When parting an excel 2010 table, if a worksheet has a missing value, then it > is not reported in the sax handler. As a result a missing value can result in > unordered data. > For example given the table: > {code:title=Bar.java|borderStyle=solid} > A B B > 1 2 3 > 4 6 > 7 8 9 > {code} > the returned sax handler reports elements > {code:title=Bar.java|borderStyle=solid} > ABC > 123 > 46 > 789 > {code} > As a result the handler can detect that the third row as incomplete cell > values but it is ambiguous which columns have missing data. > As a possible fix for this excel 2010 xml data contains the cell reference > value, which could be returned to the sax handler as an attribute. > {code:title=Bar.java|borderStyle=solid} > *** XSSFExcelExtractorDecorator.java2012-11-08 10:51:55.881207100 + > --- XSSFExcelExtractorDecorator.java.1 2012-11-08 10:59:02.972223700 + > *** > *** 200,206 > > public void cell(String cellRef, String formattedValue) { > try { > ! xhtml.startElement("td"); > >// Main cell contents >xhtml.characters(formattedValue); > --- 200,208 > > public void cell(String cellRef, String formattedValue) { > try { > ! AttributesImpl attributes = new AttributesImpl(); > ! attributes.addAttribute(null, "cellRef", "cellRef", null, > cellRef) ; > ! xhtml.startElement("td",attributes); > >// Main cell contents >xhtml.characters(formattedValue); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values
[ https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389884#comment-16389884 ] Radim Rehurek commented on TIKA-1020: - We just hit this bug too. I say "bug" because Excel spreadsheets are really tables with rows, just like [~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes little sense. [~tpalsulich] empty rows could be reported too, but in our use-case, the critical thing is not to have jumbled records (caused by missing cells in a single row). > Excel 2010 parser missing cell values are not reported resulting in missing > columns values > -- > > Key: TIKA-1020 > URL: https://issues.apache.org/jira/browse/TIKA-1020 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 > Environment: java 1.6 & 1.7 >Reporter: Neil Blue >Priority: Major > Labels: newbie, patch > > When parting an excel 2010 table, if a worksheet has a missing value, then it > is not reported in the sax handler. As a result a missing value can result in > unordered data. > For example given the table: > {code:title=Bar.java|borderStyle=solid} > A B B > 1 2 3 > 4 6 > 7 8 9 > {code} > the returned sax handler reports elements > {code:title=Bar.java|borderStyle=solid} > ABC > 123 > 46 > 789 > {code} > As a result the handler can detect that the third row as incomplete cell > values but it is ambiguous which columns have missing data. > As a possible fix for this excel 2010 xml data contains the cell reference > value, which could be returned to the sax handler as an attribute. > {code:title=Bar.java|borderStyle=solid} > *** XSSFExcelExtractorDecorator.java2012-11-08 10:51:55.881207100 + > --- XSSFExcelExtractorDecorator.java.1 2012-11-08 10:59:02.972223700 + > *** > *** 200,206 > > public void cell(String cellRef, String formattedValue) { > try { > ! xhtml.startElement("td"); > >// Main cell contents >xhtml.characters(formattedValue); > --- 200,208 > > public void cell(String cellRef, String formattedValue) { > try { > ! AttributesImpl attributes = new AttributesImpl(); > ! attributes.addAttribute(null, "cellRef", "cellRef", null, > cellRef) ; > ! xhtml.startElement("td",attributes); > >// Main cell contents >xhtml.characters(formattedValue); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Tika 1.18?
I thought about logging any custom-mimetype override applied, so the user will be warned about that. Maybe additionally creating a specific attribute in mimetype definition xml to configure it must override the default one instead of aborting. About multiple conflicting custom mimes from different (external) projetcs, Tika currently aborts and it is already a problem now. So I think it needs additional discussion and should not be addressed in the next release. Will copy/paste this discussion in the jira issue. But I would like to see fixed the detection of MTS videos, but it conflicts with another existing mime glob. Any workaround for this specific case? If yes, I can open a different ticket. Em 2 de mar de 2018 18:23, "Nick Burch"escreveu: On Fri, 2 Mar 2018, Luís Filipe Nassif wrote: > If I make no progress on TIKA-1466 until 3/9, you can start the release > process without it. But do you devs agree with the proposed change: allow > overriding of glob patterns in custom-mimetypes.xml? > What happens if you have two different custom files which both claim the same glob? We have historically been a bit stricter about built-in types overriding, in part to avoid people doing silly things by mistake, and in part to push people a bit more towards contributing fixes/enhancements for built-in types. I think the latter is less of a thing today, as we've a lot more covered as standard, so it's just the former we need to worry about. How do we help people know when they have conflicting overrides (possibly from different projects), help them sensibly merge or turn off Tika provided magic+definitions, and to alert them to when their copied + customised version probably wants updating following a tika upgrade giving a newer definition? Do a better job of those than we currently do now, then I'm very happy to +1 it :) Nick
[jira] [Commented] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy
[ https://issues.apache.org/jira/browse/TIKA-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389743#comment-16389743 ] Hudson commented on TIKA-2600: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1450 (See [https://builds.apache.org/job/Tika-trunk/1450/]) TIKA-2600 -- remove md5 checksum, and switch sha-1 to sha-512 for (tallison: [https://github.com/apache/tika/commit/19017c91b245ebd72fefe005cd67d3da68037cc5]) * (edit) pom.xml > Don't use md5 checksum due to changes to the release distribuition policy > - > > Key: TIKA-2600 > URL: https://issues.apache.org/jira/browse/TIKA-2600 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Blocker > Fix For: 1.18, 2.0.0 > > > To plagiarize from PDFBOX-4142: > The release distribution policy was changes with regard to the checksums to > be used: > Old policy : > MUST provide a MD5-file > SHOULD provide a SHA-file [SHA-512 recommended] > New policy : > MUST provide a SHA- or MD5-file > SHOULD provide a SHA-file > SHOULD NOT provide a MD5-file > see http://www.apache.org/dev/release-distribution for further details -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy
[ https://issues.apache.org/jira/browse/TIKA-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389694#comment-16389694 ] Hudson commented on TIKA-2600: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #211 (See [https://builds.apache.org/job/tika-2.x-windows/211/]) TIKA-2600 -- remove md5 checksum, and switch sha-1 to sha-512 for (tallison: rev 19017c91b245ebd72fefe005cd67d3da68037cc5) * (edit) pom.xml > Don't use md5 checksum due to changes to the release distribuition policy > - > > Key: TIKA-2600 > URL: https://issues.apache.org/jira/browse/TIKA-2600 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Blocker > Fix For: 1.18, 2.0.0 > > > To plagiarize from PDFBOX-4142: > The release distribution policy was changes with regard to the checksums to > be used: > Old policy : > MUST provide a MD5-file > SHOULD provide a SHA-file [SHA-512 recommended] > New policy : > MUST provide a SHA- or MD5-file > SHOULD provide a SHA-file > SHOULD NOT provide a MD5-file > see http://www.apache.org/dev/release-distribution for further details -- This message was sent by Atlassian JIRA (v7.6.3#76005)
tika-2.x-windows - Build # 211 - Still Failing
The Apache Jenkins build system has built tika-2.x-windows (build #211) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/211/ to view the results.
[jira] [Commented] (TIKA-2579) Update to PDFBox 2.0.9 when available
[ https://issues.apache.org/jira/browse/TIKA-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389677#comment-16389677 ] Tim Allison commented on TIKA-2579: --- Release cycle for PDFBox 2.0.9 is just getting under way. https://lists.apache.org/thread.html/63f4f538de8ba684a18c9514a64ebfb8fa30053dfb885e459ccd6741@%3Cdev.pdfbox.apache.org%3E > Update to PDFBox 2.0.9 when available > - > > Key: TIKA-2579 > URL: https://issues.apache.org/jira/browse/TIKA-2579 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.17 >Reporter: David Pilato >Assignee: Tim Allison >Priority: Major > > Hey team > > We got this report in elasticsearch ingest attachment project: > [https://github.com/elastic/elasticsearch/issues/27198] > Basically when a font is not available PDFBox is throwing an exception like > {{2017/10/31 00:01:13.348 [WARN ] [elasticsearch[test][bulk][T#3]] > [FontManager] Font not found: TimesNewRomanPS-BoldMT 2017/10/31 00:01:13.413 > [ERROR] [elasticsearch[test][bulk][T#3]] [TrueTypeFont] An error occured when > reading table cmap java.io.IOException: CMap subtype 14 not yet implemented > at > org.apache.fontbox.ttf.CMAPEncodingEntry.processSubtype14(CMAPEncodingEntry.java:304) > at > org.apache.fontbox.ttf.CMAPEncodingEntry.initSubtable(CMAPEncodingEntry.java:114) > at org.apache.fontbox.ttf.CMAPTable.initData(CMAPTable.java:100) at > org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280) at > org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128) > at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80) at > org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109) > at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at > org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84) > at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632) > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673) > at > org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231) > at > org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:533) > at > org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) > at > org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62) > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:458) > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) > at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148) at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148) at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at > org.apache.tika.Tika.parseToString(Tika.java:537)}} > This might have been solved by PDFParser with > https://issues.apache.org/jira/browse/PDFBOX-3997 which is available in > PDFBox 2.0.9 but Tika 1.17 is still using 2.0.8. See related issue > https://issues.apache.org/jira/browse/PDFBOX-3985. Unclear if that will > actually fix the problem reported but FWIW upgrading to 2.0.9 of PDFBox could > be useful. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2594) Mail detected as application/xhtml+xml
[ https://issues.apache.org/jira/browse/TIKA-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389661#comment-16389661 ] Tim Allison commented on TIKA-2594: --- add the following or is this too lenient? {noformat} {noformat} > Mail detected as application/xhtml+xml > -- > > Key: TIKA-2594 > URL: https://issues.apache.org/jira/browse/TIKA-2594 > Project: Tika > Issue Type: Bug >Affects Versions: 2.0, 1.16, 1.17 >Reporter: Andreas Meier >Priority: Major > Attachments: TestMail_inline_xhtml_plus_image.eml > > > The attached mail (message/rfc822) with inline xhtml is recognized as > application/xhtml+xml > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy
[ https://issues.apache.org/jira/browse/TIKA-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2600. --- Resolution: Fixed Assignee: Tim Allison Fix Version/s: 2.0.0 1.18 > Don't use md5 checksum due to changes to the release distribuition policy > - > > Key: TIKA-2600 > URL: https://issues.apache.org/jira/browse/TIKA-2600 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Blocker > Fix For: 1.18, 2.0.0 > > > To plagiarize from PDFBOX-4142: > The release distribution policy was changes with regard to the checksums to > be used: > Old policy : > MUST provide a MD5-file > SHOULD provide a SHA-file [SHA-512 recommended] > New policy : > MUST provide a SHA- or MD5-file > SHOULD provide a SHA-file > SHOULD NOT provide a MD5-file > see http://www.apache.org/dev/release-distribution for further details -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2598) Fix dependency convergence
[ https://issues.apache.org/jira/browse/TIKA-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389631#comment-16389631 ] Hudson commented on TIKA-2598: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #210 (See [https://builds.apache.org/job/tika-2.x-windows/210/]) TIKA-2598 -- unbreak the build (sorry, again!), fix missing javacpp (tallison: rev 474122bef3d906f81b91729a970a6ad7b5639a5c) * (edit) tika-dl/pom.xml > Fix dependency convergence > -- > > Key: TIKA-2598 > URL: https://issues.apache.org/jira/browse/TIKA-2598 > Project: Tika > Issue Type: Improvement > Components: packaging >Affects Versions: 1.17 >Reporter: Guillaume Smet >Assignee: Tim Allison >Priority: Blocker > Fix For: 2.0, 1.18 > > > Hi, > We tried to upgrade Tika to 1.17 in Hibernate Search and we had some > dependency convergence issues: > {code} > Dependency convergence error for > com.healthmarketscience.jackcess:jackcess:2.1.8 paths to dependency are: > +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT > +-org.apache.tika:tika-parsers:1.17 > +-com.healthmarketscience.jackcess:jackcess:2.1.8 > and > +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT > +-org.apache.tika:tika-parsers:1.17 > +-com.healthmarketscience.jackcess:jackcess-encrypt:2.1.2 > +-com.healthmarketscience.jackcess:jackcess:2.1.0 > {code} > We could fix them downstream in Hibernate Search but I thought it would be > better if Tika could ensure the convergence of its dependencies using the > Maven enforcer plugin so that all the downstream projects can benefit from it. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
tika-2.x-windows - Build # 210 - Still Failing
The Apache Jenkins build system has built tika-2.x-windows (build #210) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/210/ to view the results.
[jira] [Commented] (TIKA-2598) Fix dependency convergence
[ https://issues.apache.org/jira/browse/TIKA-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389599#comment-16389599 ] Hudson commented on TIKA-2598: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #4 (See [https://builds.apache.org/job/tika-branch-1x/4/]) TIKA-2598 -- unbreak the build (sorry, again!), fix missing javacpp (tallison: [https://github.com/apache/tika/commit/8163b598a73733554a8a87bde10a562291e4ec79]) * (edit) tika-dl/pom.xml > Fix dependency convergence > -- > > Key: TIKA-2598 > URL: https://issues.apache.org/jira/browse/TIKA-2598 > Project: Tika > Issue Type: Improvement > Components: packaging >Affects Versions: 1.17 >Reporter: Guillaume Smet >Assignee: Tim Allison >Priority: Blocker > Fix For: 2.0, 1.18 > > > Hi, > We tried to upgrade Tika to 1.17 in Hibernate Search and we had some > dependency convergence issues: > {code} > Dependency convergence error for > com.healthmarketscience.jackcess:jackcess:2.1.8 paths to dependency are: > +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT > +-org.apache.tika:tika-parsers:1.17 > +-com.healthmarketscience.jackcess:jackcess:2.1.8 > and > +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT > +-org.apache.tika:tika-parsers:1.17 > +-com.healthmarketscience.jackcess:jackcess-encrypt:2.1.2 > +-com.healthmarketscience.jackcess:jackcess:2.1.0 > {code} > We could fix them downstream in Hibernate Search but I thought it would be > better if Tika could ensure the convergence of its dependencies using the > Maven enforcer plugin so that all the downstream projects can benefit from it. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2598) Fix dependency convergence
[ https://issues.apache.org/jira/browse/TIKA-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389595#comment-16389595 ] Hudson commented on TIKA-2598: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1448 (See [https://builds.apache.org/job/Tika-trunk/1448/]) TIKA-2598 -- unbreak the build (sorry, again!), fix missing javacpp (tallison: [https://github.com/apache/tika/commit/474122bef3d906f81b91729a970a6ad7b5639a5c]) * (edit) tika-dl/pom.xml > Fix dependency convergence > -- > > Key: TIKA-2598 > URL: https://issues.apache.org/jira/browse/TIKA-2598 > Project: Tika > Issue Type: Improvement > Components: packaging >Affects Versions: 1.17 >Reporter: Guillaume Smet >Assignee: Tim Allison >Priority: Blocker > Fix For: 2.0, 1.18 > > > Hi, > We tried to upgrade Tika to 1.17 in Hibernate Search and we had some > dependency convergence issues: > {code} > Dependency convergence error for > com.healthmarketscience.jackcess:jackcess:2.1.8 paths to dependency are: > +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT > +-org.apache.tika:tika-parsers:1.17 > +-com.healthmarketscience.jackcess:jackcess:2.1.8 > and > +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT > +-org.apache.tika:tika-parsers:1.17 > +-com.healthmarketscience.jackcess:jackcess-encrypt:2.1.2 > +-com.healthmarketscience.jackcess:jackcess:2.1.0 > {code} > We could fix them downstream in Hibernate Search but I thought it would be > better if Tika could ensure the convergence of its dependencies using the > Maven enforcer plugin so that all the downstream projects can benefit from it. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy
[ https://issues.apache.org/jira/browse/TIKA-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389556#comment-16389556 ] Tim Allison edited comment on TIKA-2600 at 3/7/18 1:47 PM: --- I'm pretty sure we discussed this at some point, but I can't quickly find what we decided. Apologies if this is a duplicate issue... Should we stop including md5 and swap SHA1 (with file ext: .sha) for SHA-512 (with file ext: .sha512)? was (Author: talli...@mitre.org): I'm pretty sure we discussed this at some point, but I can't quickly find what we decided. Apologies if this is a duplicate issue... Should we stop including md5 and swap SHA1 (with file ext: .sha) for SHA512 (with file ext: .sha512)? > Don't use md5 checksum due to changes to the release distribuition policy > - > > Key: TIKA-2600 > URL: https://issues.apache.org/jira/browse/TIKA-2600 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Blocker > > To plagiarize from PDFBOX-4142: > The release distribution policy was changes with regard to the checksums to > be used: > Old policy : > MUST provide a MD5-file > SHOULD provide a SHA-file [SHA-512 recommended] > New policy : > MUST provide a SHA- or MD5-file > SHOULD provide a SHA-file > SHOULD NOT provide a MD5-file > see http://www.apache.org/dev/release-distribution for further details -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy
[ https://issues.apache.org/jira/browse/TIKA-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389556#comment-16389556 ] Tim Allison commented on TIKA-2600: --- I'm pretty sure we discussed this at some point, but I can't quickly find what we decided. Apologies if this is a duplicate issue... Should we stop including md5 and swap SHA1 (with file ext: .sha) for SHA512 (with file ext: .sha512)? > Don't use md5 checksum due to changes to the release distribuition policy > - > > Key: TIKA-2600 > URL: https://issues.apache.org/jira/browse/TIKA-2600 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Blocker > > To plagiarize from PDFBOX-4142: > The release distribution policy was changes with regard to the checksums to > be used: > Old policy : > MUST provide a MD5-file > SHOULD provide a SHA-file [SHA-512 recommended] > New policy : > MUST provide a SHA- or MD5-file > SHOULD provide a SHA-file > SHOULD NOT provide a MD5-file > see http://www.apache.org/dev/release-distribution for further details -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2600) Don't use md5 checksum due to changes to the release distribuition policy
Tim Allison created TIKA-2600: - Summary: Don't use md5 checksum due to changes to the release distribuition policy Key: TIKA-2600 URL: https://issues.apache.org/jira/browse/TIKA-2600 Project: Tika Issue Type: Task Reporter: Tim Allison To plagiarize from PDFBOX-4142: The release distribution policy was changes with regard to the checksums to be used: Old policy : MUST provide a MD5-file SHOULD provide a SHA-file [SHA-512 recommended] New policy : MUST provide a SHA- or MD5-file SHOULD provide a SHA-file SHOULD NOT provide a MD5-file see http://www.apache.org/dev/release-distribution for further details -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389532#comment-16389532 ] Tim Allison commented on TIKA-1518: --- Hi [~davemeikle], with the new dockerfile-maven-plugin, I'm getting the following. I'm behind a proxy, and I'm on windows, but you'd think localhost would work?! Any recommendations? Thank you! {noformat} [INFO] --- dockerfile-maven-plugin:1.3.7:build (default) @ tika-server --- [INFO] Building Docker context C:\Users\tallison\Idea Projects\tika-asf2-git-2.x\tika-server [INFO] [INFO] Image will be built as apache/tika-server:2.0.0-SNAPSHOT [INFO] [WARNING] An attempt failed, will retry 1 more times org.apache.maven.plugin.MojoExecutionException: Could not build image at com.spotify.plugin.dockerfile.BuildMojo.buildImage(BuildMojo.java:185) at com.spotify.plugin.dockerfile.BuildMojo.execute(BuildMojo.java:105) at com.spotify.plugin.dockerfile.AbstractDockerMojo.tryExecute(AbstractDockerMojo.java:246) at com.spotify.plugin.dockerfile.AbstractDockerMojo.execute(AbstractDockerMojo.java:235) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:13 4) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder. java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder. java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleTh readedBuilder.java:51) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:309) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:194) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:107) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:993) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:345) at org.apache.maven.cli.MavenCli.main(MavenCli.java:191) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) Caused by: com.spotify.docker.client.exceptions.DockerException: java.util.concurrent.ExecutionException: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: org.apache.http.conn.HttpHostConnectExce ption: Connect to localhost:2375 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refus ed: connect at com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:2512) at com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:2443) at com.spotify.docker.client.DefaultDockerClient.version(DefaultDockerClient.java:501) at com.spotify.docker.client.DefaultDockerClient.authRegistryHeader(DefaultDockerClient.java:2555) at com.spotify.docker.client.DefaultDockerClient.build(DefaultDockerClient.java:1396) at com.spotify.docker.client.DefaultDockerClient.build(DefaultDockerClient.java:1365) at com.spotify.plugin.dockerfile.BuildMojo.buildImage(BuildMojo.java:178) ... 25 more Caused by: java.util.concurrent.ExecutionException: com.spotify.docker.client.shaded.javax.ws.rs.Processin gException: org.apache.http.conn.HttpHostConnectException: Connect to localhost:2375 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused: connect at jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture .java:299) at jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java :286) at jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:2441) ... 30 more Caused by: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
[jira] [Commented] (TIKA-2598) Fix dependency convergence
[ https://issues.apache.org/jira/browse/TIKA-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389357#comment-16389357 ] Guillaume Smet commented on TIKA-2598: -- Hi [~talli...@mitre.org], Sorry for the delay. So to fix the issue, you can use exclusions as you did. The drawback of this approach is that, if a new dependency adds a component with yet another version, you need to add new exclusions. The other option is to use a {{}} section in your parent pom. All the dependencies defined in this section will have the fixed version you define, and it will enforce that to the transitive dependencies. It's usually the recommended approach, but seeing your patch, it looks like using exclusions is not that bad in your case. Thanks for the quick action on this! > Fix dependency convergence > -- > > Key: TIKA-2598 > URL: https://issues.apache.org/jira/browse/TIKA-2598 > Project: Tika > Issue Type: Improvement > Components: packaging >Affects Versions: 1.17 >Reporter: Guillaume Smet >Assignee: Tim Allison >Priority: Blocker > Fix For: 2.0, 1.18 > > > Hi, > We tried to upgrade Tika to 1.17 in Hibernate Search and we had some > dependency convergence issues: > {code} > Dependency convergence error for > com.healthmarketscience.jackcess:jackcess:2.1.8 paths to dependency are: > +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT > +-org.apache.tika:tika-parsers:1.17 > +-com.healthmarketscience.jackcess:jackcess:2.1.8 > and > +-org.hibernate:hibernate-search-engine:5.10.0-SNAPSHOT > +-org.apache.tika:tika-parsers:1.17 > +-com.healthmarketscience.jackcess:jackcess-encrypt:2.1.2 > +-com.healthmarketscience.jackcess:jackcess:2.1.0 > {code} > We could fix them downstream in Hibernate Search but I thought it would be > better if Tika could ensure the convergence of its dependencies using the > Maven enforcer plugin so that all the downstream projects can benefit from it. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)