[GitHub] tika pull request: add applyProbabilitySelection mechanism
Github user LukeLiush closed the pull request at: https://github.com/apache/tika/pull/23 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tika pull request: add applyProbabilities mechanism
GitHub user LukeLiush opened a pull request: https://github.com/apache/tika/pull/24 add applyProbabilities mechanism reformat code for change You can merge this pull request into a Git repository by running: $ git pull https://github.com/LukeLiush/tika mimeDetection Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/24.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #24 commit 2353f01072e94b893fd9373fb96f11d16474433e Author: LukeLiush hanson311...@gmail.com Date: 2015-01-19T02:52:48Z add applyProbabilities mechanism --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Tika Server docker image
Hi, folks. In context of https://issues.apache.org/jira/browse/TIKA-1518 (create docker image with Tika Server). There's no Apache docker registry (see INFRA-9035 and INFRA-8441). There's no docker hub intergration with apache repos, as far as I know. So there's no way to create some official docker build currently. Is unofficial image with automated build a reasonable answer to TIKA-1518 since we can't provide official images yet? -- Best regards, Konstantin Gribov
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281086#comment-14281086 ] Luis Filipe Nassif edited comment on TIKA-1511 at 1/18/15 2:09 PM: --- If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not be read, I think using EDE is not useful. How will this approach work with TikaCli --extract option? My original idea was to support an use case like TikaCli --extract... Now I think this extraction of tables to files can be done handling the db as one big doc and using a ContentHandlerDecorator that will split the xhtml output at table boundaries. Each xhtml segment can be converted to a byte[] (if small) and then to a ByteArrayInputStream that can be handled by an EmbeddedDocDecorator, if setted into parseContext. If not setted the ContentHandlerDecorator do not need to split tables and can fallback to default behavior. A custom EDE can then extract tables to files if desired. So now I think we could go with the big doc approah. What do you think? was (Author: lfcnassif): If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not be read, I think using EDE is not useful. How will this approach work with TikaCli --extract option? My original idea was to support an use case like TikaCli --extract... Now I think this extraction of tables to files can be done handling the db as one big doc and using a ContentHandlerDecorator that will split the xhtml output at table bondaries. Each xhtml segment can be converted to a byte[] (if small) and then to a ByteArrayInputStream that can be passed to a EmbeddedDocDecorator, if set on parseContext. If not set the ContentHandlerDecorator do not need to split tables and can fallBack to default behavior. A custom EDE can then extract tables to files if desired. So now I think we could go with the big doc approah. What do you think? Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1521) Handle password protected 7zip files
[ https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1521. -- Resolution: Fixed Fix Version/s: 1.8 Support and test added in r1652869. Until COMPRESS-269 is resolved, the way to detect a password protected 7zip file isn't ideal. If we have a password given that's wrong, silently we'll just report no output, but again that's a Commons Compress thing (it'll report no entries). For now it largely works though! Handle password protected 7zip files Key: TIKA-1521 URL: https://issues.apache.org/jira/browse/TIKA-1521 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Nick Burch Fix For: 1.8 While working on TIKA-1028, I notice that while Commons Compress doesn't currently handle decrypting password protected zip files, it does handle password protected 7zip files We should therefore add logic into the package parser to spot password protected 7zip files, and fetch the password for them from a PasswordProvider if given -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1521) Handle password protected 7zip files
[ https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282025#comment-14282025 ] Hudson commented on TIKA-1521: -- UNSTABLE: Integrated in tika-trunk-jdk1.7 #440 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/440/]) TIKA-1521 Support password protected 7zip files (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1652869) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pkg/Seven7ParserTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/test7Z_protected_passTika.7z Handle password protected 7zip files Key: TIKA-1521 URL: https://issues.apache.org/jira/browse/TIKA-1521 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Nick Burch Fix For: 1.8 While working on TIKA-1028, I notice that while Commons Compress doesn't currently handle decrypting password protected zip files, it does handle password protected 7zip files We should therefore add logic into the package parser to spot password protected 7zip files, and fetch the password for them from a PasswordProvider if given -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1028) Tika-server quits parsing of rfc-822 document prematurely when it encounters encrypted zip file as attachment.
[ https://issues.apache.org/jira/browse/TIKA-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282024#comment-14282024 ] Hudson commented on TIKA-1028: -- UNSTABLE: Integrated in tika-trunk-jdk1.7 #440 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/440/]) TIKA-1028 Refactor the RFC822 parser to setup recursion once per file, not once per attachment, and get it so that a non-encrypted zip attachment is correctly extracted. (Commons Compress currently lacks password protected zip support (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1652866) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/testRFC822_normal_zip Tika-server quits parsing of rfc-822 document prematurely when it encounters encrypted zip file as attachment. -- Key: TIKA-1028 URL: https://issues.apache.org/jira/browse/TIKA-1028 Project: Tika Issue Type: Bug Components: mime, parser, server Affects Versions: 1.2, 1.3, 1.4, 1.5, 1.6, 1.7 Reporter: Juha Haaga Fix For: 1.8 Attachments: Document.zip, test.eml The Zip parser in tika-server does not allow passing in the password for decrypting the zip file and doesn't handle the unsupported feature gracefully. Problem happens when zip file is attached part of email document being parsed, and the parser gives up and throws an exception: WARNING: all: Unpacker failed org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@10fcc945 Caused by: org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: unsupported feature encryption used in entry Instead of returning the successfully parsed components, Tika-server returns nothing. It would be better to return rest of the parsed document contents along with the untouched offending zip file in the archive that Tika-server returns as a result. Until the feature of zip file decrypting is added this would always return untouched zip file, and after it is implemented it should return the untouched zip file in the cases where wrong password was provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1521) Handle password protected 7zip files
[ https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282033#comment-14282033 ] Hudson commented on TIKA-1521: -- UNSTABLE: Integrated in tika-trunk-jdk1.6 #425 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/425/]) TIKA-1521 Support password protected 7zip files (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1652869) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pkg/Seven7ParserTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/test7Z_protected_passTika.7z Handle password protected 7zip files Key: TIKA-1521 URL: https://issues.apache.org/jira/browse/TIKA-1521 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Nick Burch Fix For: 1.8 While working on TIKA-1028, I notice that while Commons Compress doesn't currently handle decrypting password protected zip files, it does handle password protected 7zip files We should therefore add logic into the package parser to spot password protected 7zip files, and fetch the password for them from a PasswordProvider if given -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1028) Tika-server quits parsing of rfc-822 document prematurely when it encounters encrypted zip file as attachment.
[ https://issues.apache.org/jira/browse/TIKA-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1028. -- Resolution: Fixed Fix Version/s: 1.8 As of r1652866, I think we've got it working as well as we can for now. Because Commons Compress doesn't currently support decrypting password protected zips, we can't get the contents of the zip entries even with the password. However, we do now show the zip entry names, we don't abort, and we do manage to get the text of a .txt in a normal .zip in a rfc822 mail attachment Tika-server quits parsing of rfc-822 document prematurely when it encounters encrypted zip file as attachment. -- Key: TIKA-1028 URL: https://issues.apache.org/jira/browse/TIKA-1028 Project: Tika Issue Type: Bug Components: mime, parser, server Affects Versions: 1.2, 1.3, 1.4, 1.5, 1.6, 1.7 Reporter: Juha Haaga Fix For: 1.8 Attachments: Document.zip, test.eml The Zip parser in tika-server does not allow passing in the password for decrypting the zip file and doesn't handle the unsupported feature gracefully. Problem happens when zip file is attached part of email document being parsed, and the parser gives up and throws an exception: WARNING: all: Unpacker failed org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@10fcc945 Caused by: org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: unsupported feature encryption used in entry Instead of returning the successfully parsed components, Tika-server returns nothing. It would be better to return rest of the parsed document contents along with the untouched offending zip file in the archive that Tika-server returns as a result. Until the feature of zip file decrypting is added this would always return untouched zip file, and after it is implemented it should return the untouched zip file in the cases where wrong password was provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1521) Handle password protected 7zip files
Nick Burch created TIKA-1521: Summary: Handle password protected 7zip files Key: TIKA-1521 URL: https://issues.apache.org/jira/browse/TIKA-1521 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Nick Burch While working on TIKA-1028, I notice that while Commons Compress doesn't currently handle decrypting password protected zip files, it does handle password protected 7zip files We should therefore add logic into the package parser to spot password protected 7zip files, and fetch the password for them from a PasswordProvider if given -- This message was sent by Atlassian JIRA (v6.3.4#6332)