[jira] [Updated] (TIKA-2463) Parsing large tiff produces StackOverflowError

2017-09-12 Thread Mike Cantrell (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Cantrell updated TIKA-2463:

Attachment: test.zip

test.zip containing the example test.tiff file

> Parsing large tiff produces StackOverflowError
> --
>
> Key: TIKA-2463
> URL: https://issues.apache.org/jira/browse/TIKA-2463
> Project: Tika
>  Issue Type: Bug
>Reporter: Mike Cantrell
> Attachments: test.zip
>
>
> java -jar tika-app-1.16.jar test.tiff 
> (attaching tiff to issue) 
> {code}
> Exception in thread "main" java.lang.StackOverflowError
>   at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:153)
>   at java.lang.StringCoding.decode(StringCoding.java:193)
>   at java.lang.StringCoding.decode(StringCoding.java:254)
>   at java.lang.String.(String.java:546)
>   at 
> com.drew.lang.RandomAccessReader.getNullTerminatedString(RandomAccessReader.java:405)
>   at com.drew.imaging.tiff.TiffReader.processTag(TiffReader.java:267)
>   at com.drew.imaging.tiff.TiffReader.processIfd(TiffReader.java:224)
>   at com.drew.imaging.tiff.TiffReader.processIfd(TiffReader.java:244)
>   at com.drew.imaging.tiff.TiffReader.processIfd(TiffReader.java:244)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2463) Parsing large tiff produces StackOverflowError

2017-09-12 Thread Mike Cantrell (JIRA)
Mike Cantrell created TIKA-2463:
---

 Summary: Parsing large tiff produces StackOverflowError
 Key: TIKA-2463
 URL: https://issues.apache.org/jira/browse/TIKA-2463
 Project: Tika
  Issue Type: Bug
Reporter: Mike Cantrell


java -jar tika-app-1.16.jar test.tiff 

(attaching tiff to issue) 

{code}
Exception in thread "main" java.lang.StackOverflowError
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:153)
at java.lang.StringCoding.decode(StringCoding.java:193)
at java.lang.StringCoding.decode(StringCoding.java:254)
at java.lang.String.(String.java:546)
at 
com.drew.lang.RandomAccessReader.getNullTerminatedString(RandomAccessReader.java:405)
at com.drew.imaging.tiff.TiffReader.processTag(TiffReader.java:267)
at com.drew.imaging.tiff.TiffReader.processIfd(TiffReader.java:224)
at com.drew.imaging.tiff.TiffReader.processIfd(TiffReader.java:244)
at com.drew.imaging.tiff.TiffReader.processIfd(TiffReader.java:244)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2462) Add a parser for sas7bdat

2017-09-12 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2462:
-

 Summary: Add a parser for sas7bdat
 Key: TIKA-2462
 URL: https://issues.apache.org/jira/browse/TIKA-2462
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison


EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate parso 
into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 !!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Tika 2.0?

2017-09-12 Thread Chris Mattmann
B it is, proceed (



On 9/12/17, 5:10 AM, "Allison, Timothy B."  wrote:

I'd strongly advocate for 2.  I _think_ the hard work was laying out the 
general structure and adding the ProxyParser workaround.  Copying and 
pasting/reworking into that structure will be: 

A) far less dangerous than 1 
And
B) we'll have a cleaner history.

On A), I know that we didn't add some major components including: 
configurability of parsers, completely cleaned up logging, numerous bug fixes 
and even entire modules (tika-dl).

On B), there were a few times where I "caught a parser up" in 2.0 not by 
individual commits based on the original history but based on a copy/paste from 
the contemporaneous master.  This obliterated the history of some commits on 
the 2.0 branch and would force us to look back at master.

-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com] 
Sent: Monday, September 11, 2017 9:48 PM
To: dev@tika.apache.org
Subject: Re: Tika 2.0?

Just so it's clear are we going to:

1) Rename the 2.0 branch over to master

or

2) Re-apply the changes on master. 

I recall Chris' preference was 1 which would be quicker.  However there is 
very likely missed patches.  2 will be more time consuming but it would be more 
likely to include all the most recent code.  I'm open to either.  Not sure how 
far out of date 2.0 branch is so I defer to Tim on the risk of going with #1.


- Bob


On 9/11/2017 5:15 PM, Chris Mattmann wrote:
> +1000
>
>
>
> On 9/11/17, 12:03 PM, "Allison, Timothy B."  wrote:
>
> Y, well, I didn't say _which_ September...
> 
> Given my limited availability to work on this in Sept and POI's 
decision to move to Java 1.8, I propose releasing Tika 1.17 after the release 
of POI 3.17 and PDFBox 2.0.8.  This would be the last version of Tika at the 
Java 1.7 level, and then we bump the Java requirement to 1.8, switch master to 
the 2.0 layout and create a 1.x maintenance branch (with Java 1.8) for quick 
critical bug fixes/security vulnerabilities until we release 2.0.
> 
> What do you all think?
> 
>  
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org] 
> Sent: Monday, August 28, 2017 9:33 AM
> To: dev@tika.apache.org
> Subject: Tika 2.0?
> 
> All,
> 
>   We're getting some increasing deltas btwn the 2.0 and trunk 
branches.  Many of these are my fault; I gave up making updates to 2.0 around 
April/May, I think.
> 
>   What would people think of punting on some of the desired goals of 
2.0 (e.g. chaining parsers, more structured but still simple metadata) and 
releasing 2.0 soonish...say 2.0-BETA end of September?
> 
>   We've been able to make some major improvements to Tika without 
breaking backwards compatibility.  We _might_ be able to do that with the 
outstanding issues for 2.0 when someone has time.
> 
>   We could also do the upgrade to jdk 8 with Tika 2.0.
> 
>   If this sounds reasonable, I propose creating a 1.x branch from 
trunk for 1.x maintenance and then reworking trunk to the 2.x structure that 
Bob Paulin so elegantly worked out.  I figure we can either copy/paste from 
trunk to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 
as a model for restructuring trunk.  At this point, I'd prefer the second 
option.  The key here is to switch "trunk" to 2.0 so that we all have the 
mindset that 2.0 is what we're focused on.
> 
>The main benefit of this proposal is that we'd have a more modular 
Tika soon.
> 
>What do you think?
> 
>  Best,
> 
>Tim
> 
>
>
>







RE: Tika 2.0?

2017-09-12 Thread Allison, Timothy B.
I'd strongly advocate for 2.  I _think_ the hard work was laying out the 
general structure and adding the ProxyParser workaround.  Copying and 
pasting/reworking into that structure will be: 

A) far less dangerous than 1 
And
B) we'll have a cleaner history.

On A), I know that we didn't add some major components including: 
configurability of parsers, completely cleaned up logging, numerous bug fixes 
and even entire modules (tika-dl).

On B), there were a few times where I "caught a parser up" in 2.0 not by 
individual commits based on the original history but based on a copy/paste from 
the contemporaneous master.  This obliterated the history of some commits on 
the 2.0 branch and would force us to look back at master.

-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com] 
Sent: Monday, September 11, 2017 9:48 PM
To: dev@tika.apache.org
Subject: Re: Tika 2.0?

Just so it's clear are we going to:

1) Rename the 2.0 branch over to master

or

2) Re-apply the changes on master. 

I recall Chris' preference was 1 which would be quicker.  However there is very 
likely missed patches.  2 will be more time consuming but it would be more 
likely to include all the most recent code.  I'm open to either.  Not sure how 
far out of date 2.0 branch is so I defer to Tim on the risk of going with #1.


- Bob


On 9/11/2017 5:15 PM, Chris Mattmann wrote:
> +1000
>
>
>
> On 9/11/17, 12:03 PM, "Allison, Timothy B."  wrote:
>
> Y, well, I didn't say _which_ September...
> 
> Given my limited availability to work on this in Sept and POI's decision 
> to move to Java 1.8, I propose releasing Tika 1.17 after the release of POI 
> 3.17 and PDFBox 2.0.8.  This would be the last version of Tika at the Java 
> 1.7 level, and then we bump the Java requirement to 1.8, switch master to the 
> 2.0 layout and create a 1.x maintenance branch (with Java 1.8) for quick 
> critical bug fixes/security vulnerabilities until we release 2.0.
> 
> What do you all think?
> 
>  
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org] 
> Sent: Monday, August 28, 2017 9:33 AM
> To: dev@tika.apache.org
> Subject: Tika 2.0?
> 
> All,
> 
>   We're getting some increasing deltas btwn the 2.0 and trunk branches.  
> Many of these are my fault; I gave up making updates to 2.0 around April/May, 
> I think.
> 
>   What would people think of punting on some of the desired goals of 2.0 
> (e.g. chaining parsers, more structured but still simple metadata) and 
> releasing 2.0 soonish...say 2.0-BETA end of September?
> 
>   We've been able to make some major improvements to Tika without 
> breaking backwards compatibility.  We _might_ be able to do that with the 
> outstanding issues for 2.0 when someone has time.
> 
>   We could also do the upgrade to jdk 8 with Tika 2.0.
> 
>   If this sounds reasonable, I propose creating a 1.x branch from trunk 
> for 1.x maintenance and then reworking trunk to the 2.x structure that Bob 
> Paulin so elegantly worked out.  I figure we can either copy/paste from trunk 
> to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a 
> model for restructuring trunk.  At this point, I'd prefer the second option.  
> The key here is to switch "trunk" to 2.0 so that we all have the mindset that 
> 2.0 is what we're focused on.
> 
>The main benefit of this proposal is that we'd have a more modular 
> Tika soon.
> 
>What do you think?
> 
>  Best,
> 
>Tim
> 
>
>
>