[jira] [Reopened] (TIKA-2623) get embedded resources in PDF/doc files

2018-04-09 Thread Ohad R (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad R reopened TIKA-2623:
--

keep this "open" until merge

https://github.com/apache/tika/pull/233

> get embedded resources in PDF/doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
> Fix For: 1.18
>
>
> The motivation: support embedded files in PDF, Word's doc/docx, etc.
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


RE: TIKA-1509 (2.x breaking parser change) - ready for first review!

2018-04-09 Thread Nick Burch

On Tue, 10 Apr 2018, Allison, Timothy B. wrote:
It looks like you merged to master, which, I think is the base for 
2.0.0-SNAPSHOT.  I've been treating branch_1x as the master for 1.x.[1]


Ah, I'd thought that the 2.x branch (with the tika-parser-bundles / 
tika-parser-modules folders) was the one for 2.x, and master was still for 
1.x. I haven't done any of my other fixes to the branch_1x branch



Any objections to me cutting 1.18-SNAPSHOT from branch_1x?


As long as that has all the other fixes on, not from me. I can merge over 
my multi-parser stuff to branch_1x next week for trying in 1.19


Nick


RE: TIKA-1509 (2.x breaking parser change) - ready for first review!

2018-04-09 Thread Allison, Timothy B.
Nick,
  It looks like you merged to master, which, I think is the base for 
2.0.0-SNAPSHOT.  I've been treating branch_1x as the master for 1.x.[1]
  Any objections to me cutting 1.18-SNAPSHOT from branch_1x?

Best,

 Tim
  
[1] 
https://lists.apache.org/thread.html/12342a115623d157063eb9f40064ccf21561cdab5cfb327f3f368aca@%3Cdev.tika.apache.org%3E

-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org] 
Sent: Sunday, April 8, 2018 8:47 AM
To: dev@tika.apache.org
Subject: Re: TIKA-1509 (2.x breaking parser change) - ready for first review!

In the absense of complaints, I've gone ahead and merged this to Tika's master 
branch for 1.x.  If I've done it right, there won't be any breaking changes for 
1.18, as everything is either new or marked as deprecated pending finalisation.

I haven't merged to 2.x yet, as it'd be good to get some feedback on the 
proposed Parser overridden parse method taking a ContentHandlerFactory method 
(to go alongside the long-standing ContentHander one for simpler
cases)

Nick

On Sun, 18 Mar 2018, Chris Mattmann wrote:
> Completely agree, awesome job Nick.
>
> I will definitely try this week as well.
>
> Thank you!
>
> Sincerely,
> Chris
>
>
>
> On 3/18/18, 2:47 PM, "David Meikle"  wrote:
>
>Nice one Nick!  Will take a look this week.
>
>Cheers,
>Dave
>
>On 14 March 2018 at 17:38, Nick Burch  wrote:
>
>> Hi All
>>
>> As promised, I've finally had a go to try and implement my ideas for
>> TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
>> breaking 2.x parser change
>>
>> My work so far is in this github branch, and is ready for review!
>> https://github.com/apache/tika/tree/multiple-parsers
>>
>>
>> It seems to work fine for the Fallback case, and for the Supplemental
>> case. You can set a policy that controls how clashing metadata is 
> handled,
>> currently "first one to set a key wins", "last one to set a key wins",
>> "ignore previous parsers", and "keep old and new unique values"
>>
>> I've also done a proof of concept for "pick best" case, to try running 
> the
>> text parser with a specified set of different charsets, capture the text
>> from each, "pick the best" (hard coded 1st...) then run for real with 
> that
>> one.
>>
>>
>> Key TODOs - Support InputStreamFactory, properly work out what mimetypes
>> to claim to support, Tika Config XML friendly helper for the metadata 
> clash
>> policy, review ContentHandlerFactory signature and tweak if needed.
>>
>> Proposed breaking 2.x change - add second parse method that takes
>> ContentHandlerFactory instead of ContentHandler, with most parsers 
> getting
>> that just grabbing a single one and using that as before
>>
>>
>> Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
>> I stop? Carry on? Modify it? Other?
>>
>> Nick
>>
>
>
>
>


[jira] [Commented] (TIKA-2091) regression: Zip bomb detected! for HTML file

2018-04-09 Thread Harinder (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431001#comment-16431001
 ] 

Harinder commented on TIKA-2091:


Thanks Tim, appreciate you taking the time to respond to this 1+ year old issue.

> regression: Zip bomb detected! for HTML file
> 
>
> Key: TIKA-2091
> URL: https://issues.apache.org/jira/browse/TIKA-2091
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.13
> Environment: Debian jessie Linux, Oracle Java 8
>Reporter: Rodrigo Rosenfeld Rosas
>Priority: Major
>
> Hi, while discussing an issue on Solr's mailing list it was suggested to me 
> to open a ticket here. Please let me know if this is not the proper place for 
> such ticket.
> After upgrading to latest Solr, this document is no longer indexing properly 
> in Solr. They told me they upgraded Tika from 1.7 to 1.13 in Solr 6.2. Before 
> the upgrade this documented was indexed as expected:
> https://www.sec.gov/Archives/edgar/data/1472033/000119380513001310/e611133_f6ef-eutelsat.htm
> I hope a fix could go on time for 1.14 ;)
> Cheers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2091) regression: Zip bomb detected! for HTML file

2018-04-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430932#comment-16430932
 ] 

Tim Allison commented on TIKA-2091:
---

Sorry...I'm not aware of an easy way to do this via Solr binaries and 
configuration.  I was working in an IDE and could easily modify the htmlmapper.

> regression: Zip bomb detected! for HTML file
> 
>
> Key: TIKA-2091
> URL: https://issues.apache.org/jira/browse/TIKA-2091
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.13
> Environment: Debian jessie Linux, Oracle Java 8
>Reporter: Rodrigo Rosenfeld Rosas
>Priority: Major
>
> Hi, while discussing an issue on Solr's mailing list it was suggested to me 
> to open a ticket here. Please let me know if this is not the proper place for 
> such ticket.
> After upgrading to latest Solr, this document is no longer indexing properly 
> in Solr. They told me they upgraded Tika from 1.7 to 1.13 in Solr 6.2. Before 
> the upgrade this documented was indexed as expected:
> https://www.sec.gov/Archives/edgar/data/1472033/000119380513001310/e611133_f6ef-eutelsat.htm
> I hope a fix could go on time for 1.14 ;)
> Cheers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2627) Exception thrown when max string length is reached

2018-04-09 Thread Caleb Ott (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Ott updated TIKA-2627:

Priority: Major  (was: Minor)

> Exception thrown when max string length is reached
> --
>
> Key: TIKA-2627
> URL: https://issues.apache.org/jira/browse/TIKA-2627
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
> Environment: Windows 2012 R2
> Java 1.8.0_151
>Reporter: Caleb Ott
>Priority: Major
> Attachments: ExceptionStacktrace.txt
>
>
> I have set the max string length and expected tika to parse up to that limit 
> then return me the text. However, for certain files it appears that once that 
> limit is reached, instead of returning the text parsed so far, it is throwing 
> an exception.
> It looks like the WriteLimitReachedException is being wrapped in another 
> exception which is why it is not being caught.
> Attached is the stack trace I am getting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2627) Exception thrown when max string length is reached

2018-04-09 Thread Caleb Ott (JIRA)
Caleb Ott created TIKA-2627:
---

 Summary: Exception thrown when max string length is reached
 Key: TIKA-2627
 URL: https://issues.apache.org/jira/browse/TIKA-2627
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.17
 Environment: Windows 2012 R2

Java 1.8.0_151
Reporter: Caleb Ott
 Attachments: ExceptionStacktrace.txt

I have set the max string length and expected tika to parse up to that limit 
then return me the text. However, for certain files it appears that once that 
limit is reached, instead of returning the text parsed so far, it is throwing 
an exception.

It looks like the WriteLimitReachedException is being wrapped in another 
exception which is why it is not being caught.

Attached is the stack trace I am getting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2623) get embedded resources in PDF/doc files

2018-04-09 Thread Ohad R (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad R resolved TIKA-2623.
--
   Resolution: Fixed
Fix Version/s: 1.18

[https://github.com/apache/tika/pull/233]

 

> get embedded resources in PDF/doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
> Fix For: 1.18
>
>
> The motivation: support embedded files in PDF, Word's doc/docx, etc.
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)