[GitHub] tika pull request: add applyProbabilitySelection mechanism

2015-01-18 Thread LukeLiush
Github user LukeLiush closed the pull request at:

https://github.com/apache/tika/pull/23


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tika pull request: add applyProbabilities mechanism

2015-01-18 Thread LukeLiush
GitHub user LukeLiush opened a pull request:

https://github.com/apache/tika/pull/24

add applyProbabilities mechanism

reformat code for change

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/LukeLiush/tika mimeDetection

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/24.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #24


commit 2353f01072e94b893fd9373fb96f11d16474433e
Author: LukeLiush hanson311...@gmail.com
Date:   2015-01-19T02:52:48Z

add applyProbabilities mechanism




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Tika Server docker image

2015-01-18 Thread Konstantin Gribov
Hi, folks.

In context of https://issues.apache.org/jira/browse/TIKA-1518 (create
docker image with Tika Server).

There's no Apache docker registry (see INFRA-9035 and INFRA-8441). There's
no docker hub intergration with apache repos, as far as I know. So there's
no way to create some official docker build currently.

Is unofficial image with automated build a reasonable answer to TIKA-1518
since we can't provide official images yet?

-- 
Best regards,
Konstantin Gribov


[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3

2015-01-18 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281086#comment-14281086
 ] 

Luis Filipe Nassif edited comment on TIKA-1511 at 1/18/15 2:09 PM:
---

If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not 
be read, I think using EDE is not useful. How will this approach work with 
TikaCli --extract option? My original idea was to support an use case like 
TikaCli --extract...

Now I think this extraction of tables to files can be done handling the db as 
one big doc and using a ContentHandlerDecorator that will split the xhtml 
output at table boundaries. Each xhtml segment can be converted to a byte[] (if 
small) and then to a ByteArrayInputStream that can be handled by an 
EmbeddedDocDecorator, if setted into parseContext. If not setted the 
ContentHandlerDecorator do not need to split tables and can fallback to default 
behavior. A custom EDE can then extract tables to files if desired.

So now I think we could go with the big doc approah. What do you think?


was (Author: lfcnassif):
If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not 
be read, I think using EDE is not useful. How will this approach work with 
TikaCli --extract option? My original idea was to support an use case like 
TikaCli --extract...

Now I think this extraction of tables to files can be done handling the db as 
one big doc and using a ContentHandlerDecorator that will split the xhtml 
output at table bondaries. Each xhtml segment can be converted to a byte[] (if 
small) and then to a ByteArrayInputStream that can be passed to a 
EmbeddedDocDecorator, if set on parseContext. If not set the 
ContentHandlerDecorator do not need to split tables and can fallBack to default 
behavior. A custom EDE can then extract tables to files if desired.

So now I think we could go with the big doc approah. What do you think?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1521) Handle password protected 7zip files

2015-01-18 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1521.
--
   Resolution: Fixed
Fix Version/s: 1.8

Support and test added in r1652869.

Until COMPRESS-269 is resolved, the way to detect a password protected 7zip 
file isn't ideal. If we have a password given that's wrong, silently we'll just 
report no output, but again that's a Commons Compress thing (it'll report no 
entries). For now it largely works though!

 Handle password protected 7zip files
 

 Key: TIKA-1521
 URL: https://issues.apache.org/jira/browse/TIKA-1521
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Nick Burch
 Fix For: 1.8


 While working on TIKA-1028, I notice that while Commons Compress doesn't 
 currently handle decrypting password protected zip files, it does handle 
 password protected 7zip files
 We should therefore add logic into the package parser to spot password 
 protected 7zip files, and fetch the password for them from a PasswordProvider 
 if given



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1521) Handle password protected 7zip files

2015-01-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282025#comment-14282025
 ] 

Hudson commented on TIKA-1521:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #440 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/440/])
TIKA-1521 Support password protected 7zip files (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1652869)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pkg/Seven7ParserTest.java
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/test7Z_protected_passTika.7z


 Handle password protected 7zip files
 

 Key: TIKA-1521
 URL: https://issues.apache.org/jira/browse/TIKA-1521
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Nick Burch
 Fix For: 1.8


 While working on TIKA-1028, I notice that while Commons Compress doesn't 
 currently handle decrypting password protected zip files, it does handle 
 password protected 7zip files
 We should therefore add logic into the package parser to spot password 
 protected 7zip files, and fetch the password for them from a PasswordProvider 
 if given



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1028) Tika-server quits parsing of rfc-822 document prematurely when it encounters encrypted zip file as attachment.

2015-01-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282024#comment-14282024
 ] 

Hudson commented on TIKA-1028:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #440 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/440/])
TIKA-1028 Refactor the RFC822 parser to setup recursion once per file, not once 
per attachment, and get it so that a non-encrypted zip attachment is correctly 
extracted. (Commons Compress currently lacks password protected zip support 
(nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1652866)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testRFC822_normal_zip


 Tika-server quits parsing of rfc-822 document prematurely when it encounters 
 encrypted zip file as attachment.
 --

 Key: TIKA-1028
 URL: https://issues.apache.org/jira/browse/TIKA-1028
 Project: Tika
  Issue Type: Bug
  Components: mime, parser, server
Affects Versions: 1.2, 1.3, 1.4, 1.5, 1.6, 1.7
Reporter: Juha Haaga
 Fix For: 1.8

 Attachments: Document.zip, test.eml


 The Zip parser in tika-server does not allow passing in the password for 
 decrypting the zip file and doesn't handle the unsupported feature 
 gracefully. Problem happens when zip file is attached part of email document 
 being parsed, and the parser gives up and throws an exception:
 WARNING: all: Unpacker failed
 org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
 org.apache.tika.parser.pkg.PackageParser@10fcc945
 Caused by: 
 org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: 
 unsupported feature encryption used in entry
 Instead of returning the successfully parsed components, Tika-server returns 
 nothing. 
 It would be better to return rest of the parsed document contents along with 
 the untouched offending zip file in the archive that Tika-server returns as a 
 result. Until the feature of zip file decrypting is added this would always 
 return untouched zip file, and after it is implemented it should return the 
 untouched zip file in the cases where wrong password was provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1521) Handle password protected 7zip files

2015-01-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282033#comment-14282033
 ] 

Hudson commented on TIKA-1521:
--

UNSTABLE: Integrated in tika-trunk-jdk1.6 #425 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/425/])
TIKA-1521 Support password protected 7zip files (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1652869)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pkg/Seven7ParserTest.java
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/test7Z_protected_passTika.7z


 Handle password protected 7zip files
 

 Key: TIKA-1521
 URL: https://issues.apache.org/jira/browse/TIKA-1521
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Nick Burch
 Fix For: 1.8


 While working on TIKA-1028, I notice that while Commons Compress doesn't 
 currently handle decrypting password protected zip files, it does handle 
 password protected 7zip files
 We should therefore add logic into the package parser to spot password 
 protected 7zip files, and fetch the password for them from a PasswordProvider 
 if given



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1028) Tika-server quits parsing of rfc-822 document prematurely when it encounters encrypted zip file as attachment.

2015-01-18 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1028.
--
   Resolution: Fixed
Fix Version/s: 1.8

As of r1652866, I think we've got it working as well as we can for now. Because 
Commons Compress doesn't currently support decrypting password protected zips, 
we can't get the contents of the zip entries even with the password. However, 
we do now show the zip entry names, we don't abort, and we do manage to get the 
text of a .txt in a normal .zip in a rfc822 mail attachment

 Tika-server quits parsing of rfc-822 document prematurely when it encounters 
 encrypted zip file as attachment.
 --

 Key: TIKA-1028
 URL: https://issues.apache.org/jira/browse/TIKA-1028
 Project: Tika
  Issue Type: Bug
  Components: mime, parser, server
Affects Versions: 1.2, 1.3, 1.4, 1.5, 1.6, 1.7
Reporter: Juha Haaga
 Fix For: 1.8

 Attachments: Document.zip, test.eml


 The Zip parser in tika-server does not allow passing in the password for 
 decrypting the zip file and doesn't handle the unsupported feature 
 gracefully. Problem happens when zip file is attached part of email document 
 being parsed, and the parser gives up and throws an exception:
 WARNING: all: Unpacker failed
 org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
 org.apache.tika.parser.pkg.PackageParser@10fcc945
 Caused by: 
 org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: 
 unsupported feature encryption used in entry
 Instead of returning the successfully parsed components, Tika-server returns 
 nothing. 
 It would be better to return rest of the parsed document contents along with 
 the untouched offending zip file in the archive that Tika-server returns as a 
 result. Until the feature of zip file decrypting is added this would always 
 return untouched zip file, and after it is implemented it should return the 
 untouched zip file in the cases where wrong password was provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1521) Handle password protected 7zip files

2015-01-18 Thread Nick Burch (JIRA)
Nick Burch created TIKA-1521:


 Summary: Handle password protected 7zip files
 Key: TIKA-1521
 URL: https://issues.apache.org/jira/browse/TIKA-1521
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Nick Burch


While working on TIKA-1028, I notice that while Commons Compress doesn't 
currently handle decrypting password protected zip files, it does handle 
password protected 7zip files

We should therefore add logic into the package parser to spot password 
protected 7zip files, and fetch the password for them from a PasswordProvider 
if given



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)