[jira] [Created] (TIKA-2565) Upgrade edu.ucar dependencies to 4.6.11

2018-02-02 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created TIKA-2565:
--

 Summary: Upgrade edu.ucar dependencies to 4.6.11
 Key: TIKA-2565
 URL: https://issues.apache.org/jira/browse/TIKA-2565
 Project: Tika
  Issue Type: Wish
  Components: parser
Affects Versions: 1.17
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.0


An [existing PR|https://github.com/apache/tika/pull/212/files] suggests to 
upgrade the netcdf4-java dependency, however it does not address the issue.
This PR will add the correct Maven repository configuration and then make the 
upgrade(s).
https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/reference/BuildDependencies.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351224#comment-16351224
 ] 

NW Brad commented on TIKA-2562:
---

I was doing some research on this today and this may not be a function of Tika. 
 I think it is probably the SAXTransformerFactory (javax.xml.transform) that is 
making the change.  At least I could find any code in Tika that did it 
directly.  But anything I ran through the SAXTransformerFactory converted the 
HTML I provided with void (empty) elements and self-closing start tags as shown 
below:

http://www.google.com;> *becomes* http://www.google.com*"/>*

and  *becomes* .

>From an XML standpoint the converted syntax is correct, but the anchor tag 
>code while correct in XML, does not appear to work correctly as HTML in both 
>the current version of Chrome and Firefox.  So, converting HTML via Tika in 
>this situation generates bad HTML for the examples I have.

I believe the SAXTransformerFactory is also deleting the  that is around 
the "empty" anchor tag since a div around nothing is may not be consider 
relevant.  I least that is what I speculate...
h1.  

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2564) Tika client cannot extract files from embedded archive formats

2018-02-02 Thread Marc Prud'hommeaux (JIRA)
Marc Prud'hommeaux created TIKA-2564:


 Summary: Tika client cannot extract files from embedded archive 
formats
 Key: TIKA-2564
 URL: https://issues.apache.org/jira/browse/TIKA-2564
 Project: Tika
  Issue Type: Bug
 Environment: Mac OS 10.13.3 (17D47)

 

17:42 ext$ java -version

java version "9.0.1"

Java(TM) SE Runtime Environment (build 9.0.1+11)

Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode)

17:42 ext$ uname -a

Darwin bix.local 17.4.0 Darwin Kernel Version 17.4.0: Sun Dec 17 09:19:54 PST 
2017; root:xnu-4570.41.2~1/RELEASE_X86_64 x86_64

 

 
Reporter: Marc Prud'hommeaux


 

This may be related to TIKA-2395. When trying to extract the files from 

tika/tika-parsers/src/test/resources/test-documents/test-documents.tgz 

 

% coursier launch org.apache.tika:tika-app:1.17 --main 
org.apache.tika.cli.TikaCLI -- --extract test-documents.tgz

I see the exception:

 

Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
Illegal IOException from org.apache.tika.parser.pkg.CompressorParser@62628e78

at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)

at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)

at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:205)

at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:486)

at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)

at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)

at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.base/java.lang.reflect.Method.invoke(Method.java:564)

at coursier.cli.qR.a(Unknown Source)

at coursier.cli.qQ.j(Unknown Source)

at coursier.cli.qW.a(Unknown Source)

at d.h.a.c(Unknown Source)

at b.b.c_(Unknown Source)

at d.b.d.E.g(Unknown Source)

at d.b.e.aW.g(Unknown Source)

at d.b.f.b.aa.a(Unknown Source)

at coursier.cli.qQ.b(Unknown Source)

at coursier.cli.Q.b(Unknown Source)

at b.J.c_(Unknown Source)

at d.F.h(Unknown Source)

at b.F.a(Unknown Source)

at coursier.cli.Coursier.main(Unknown Source)

at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)

at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.base/java.lang.reflect.Method.invoke(Method.java:564)

at coursier.Bootstrap.main(Bootstrap.java:428)

Caused by: java.io.IOException: mark/reset not supported

at java.base/java.io.InputStream.reset(InputStream.java:474)

at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:444)

at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)

at 
org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:1045)

at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:222)

at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

... 28 more

 

However, I can browse the document fine using:

 

% coursier launch org.apache.tika:tika-app:1.17 --main 
org.apache.tika.cli.TikaCLI -- test-documents.tgz

 

This issue affects: test-documents.rar, test-documents.tar.Z, 
test-documents.tbz2, and test-documents.tgz

But it does not affect test-documents.7z, test-documents.cab, 
test-documents.ddf, test-documents.dmg, test-documents.tar, or 
test-documents.zip

 

 

 This makes me suspect that it has something to do with extracting files from 
packages that are embedded in other archive parsers.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350822#comment-16350822
 ] 

Tim Allison commented on TIKA-2562:
---

I'll take a look.  This will require some digging.

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350634#comment-16350634
 ] 

NW Brad edited comment on TIKA-2562 at 2/2/18 4:51 PM:
---

Thanks.  I checked it out and tagsoup is definitely adding the shape.  I tried 
parsing the file using tagsoup command line, and tagsoup added the shape.  
However, it appears that the  removal is coming from tika.

Tagsoup parse results:


 http://www.google.com;>[http://www.google.com|http://www.google.com/]
 

Tika parse results:

http://www.google.com;>[http://www.google.com|http://www.google.com/]

The div is gone...

I also noted another problem with parsing that is coming from Tika and not 
tagsoup when dealing with hidden anchors/hyperlinks:

original:

http://www.google.com;>

Tagsoup:results

http://www.google.com*;>*

Tika results:

http://www.google.com*"/>*

Tika seems to alter anchor by removing the end-tag and replacing it with an 
empty-element tag.  This occurs on other tags as well, most common being 
 with .

This may not seem to be a big deal, but with anchors it is causing a problem 
with Chrome and Firefox and the anchor style bleeds into content immediately 
following the anchor.

Is there a way in Tika to turn off this feature?  If not, do you know where in 
the code this occurs. 

Thanks.

 

 

 

 

 

 

 

 


was (Author: nwbrad):
Thanks.  I checked it out and tagsoup is definitely adding the shape.  I tried 
parsing the file using tagsoup command line, and tagsoup is definitely the 
shape.  However, it appears that the  removal is coming from tika.

Tagsoup parse results:


 http://www.google.com;>[http://www.google.com|http://www.google.com/]
 

Tika parse results:

http://www.google.com;>[http://www.google.com|http://www.google.com/]

The div is gone...

I also noted another problem with parsing that is coming from Tika and not 
tagsoup when dealing with hidden anchors/hyperlinks:

original:

http://www.google.com;>

Tagsoup:results

http://www.google.com*;>*

Tika results:

http://www.google.com*"/>*

Tika seems to alter anchor by removing the end-tag and replacing it with an 
empty-element tag.  This occurs on other tags as well, most common being 
 with .

This may not seem to be a big deal, but with anchors it is causing a problem 
with Chrome and Firefox and the anchor style bleeds into content immediately 
following the anchor.

Is there a way in Tika to turn off this feature?  If not, do you know where in 
the code this occurs. 

Thanks.

 

 

 

 

 

 

 

 

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350634#comment-16350634
 ] 

NW Brad edited comment on TIKA-2562 at 2/2/18 4:50 PM:
---

Thanks.  I checked it out and tagsoup is definitely adding the shape.  I tried 
parsing the file using tagsoup command line, and tagsoup is definitely the 
shape.  However, it appears that the  removal is coming from tika.

Tagsoup parse results:


 http://www.google.com;>[http://www.google.com|http://www.google.com/]
 

Tika parse results:

http://www.google.com;>[http://www.google.com|http://www.google.com/]

The div is gone...

I also noted another problem with parsing that is coming from Tika and not 
tagsoup when dealing with hidden anchors/hyperlinks:

original:

http://www.google.com;>

Tagsoup:results

http://www.google.com*;>*

Tika results:

http://www.google.com*"/>*

Tika seems to alter anchor by removing the end-tag and replacing it with an 
empty-element tag.  This occurs on other tags as well, most common being 
 with .

This may not seem to be a big deal, but with anchors it is causing a problem 
with Chrome and Firefox and the anchor style bleeds into content immediately 
following the anchor.

Is there a way in Tika to turn off this feature?  If not, do you know where in 
the code this occurs. 

Thanks.

 

 

 

 

 

 

 

 


was (Author: nwbrad):
Thanks.  I check it out, it and tagsoup is definitely adding the shape.  I 
tried parsing the file using tagsoup command line, and tagsoup is definitely 
the shape.  However, it appears that the  removal is coming from tika.

Tagsoup parse results:


 http://www.google.com;>http://www.google.com
 

Tika parse results:

http://www.google.com;>http://www.google.com

The div is gone...

I also noted another problem with parsing that is coming from Tika and not 
tagsoup when dealing with hidden anchors/hyperlinks:

original:

http://www.google.com;>

Tagsoup:results

http://www.google.com*;>*

Tika results:

http://www.google.com*"/>*

Tika seems to alter anchor by removing the end-tag and replacing it with an 
empty-element tag.  This occurs on other tags as well, most common being 
 with .

This may not seem to be a big deal, but with anchors it is causing a problem 
with Chrome and Firefox and the anchor style bleeds into content immediately 
following the anchor.

Is there a way in Tika to turn off this feature?  If not, do you know where in 
the code this occurs. 

Thanks.

 

 

 

 

 

 

 

 

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350634#comment-16350634
 ] 

NW Brad commented on TIKA-2562:
---

Thanks.  I check it out, it and tagsoup is definitely adding the shape.  I 
tried parsing the file using tagsoup command line, and tagsoup is definitely 
the shape.  However, it appears that the  removal is coming from tika.

Tagsoup parse results:


 http://www.google.com;>http://www.google.com
 

Tika parse results:

http://www.google.com;>http://www.google.com

The div is gone...

I also noted another problem with parsing that is coming from Tika and not 
tagsoup when dealing with hidden anchors/hyperlinks:

original:

http://www.google.com;>

Tagsoup:results

http://www.google.com*;>*

Tika results:

http://www.google.com*"/>*

Tika seems to alter anchor by removing the end-tag and replacing it with an 
empty-element tag.  This occurs on other tags as well, most common being 
 with .

This may not seem to be a big deal, but with anchors it is causing a problem 
with Chrome and Firefox and the anchor style bleeds into content immediately 
following the anchor.

Is there a way in Tika to turn off this feature?  If not, do you know where in 
the code this occurs. 

Thanks.

 

 

 

 

 

 

 

 

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2563:
--
Description: Files (esp images) and other objects can be embedded in 
html/css/javascript with the {{data: uri scheme}}.  We should extract those 
like any other embedded file.  (was: Files (esp images) can be base64 encoded 
in HTML files.  We should extract those like any other embedded file.)

> Extract embedded files in HTML
> --
>
> Key: TIKA-2563
> URL: https://issues.apache.org/jira/browse/TIKA-2563
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: consumentenbond.html, testHTML_embedded_img.html
>
>
> Files (esp images) and other objects can be embedded in html/css/javascript 
> with the {{data: uri scheme}}.  We should extract those like any other 
> embedded file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2563:
--
Description: Files (esp images) and other objects can be embedded in 
html/css/javascript with the [data: uri 
scheme|https://en.wikipedia.org/wiki/Data_URI_scheme].  We should extract those 
like any other embedded file.  (was: Files (esp images) and other objects can 
be embedded in html/css/javascript with the {{data: uri scheme}}.  We should 
extract those like any other embedded file.)

> Extract embedded files in HTML
> --
>
> Key: TIKA-2563
> URL: https://issues.apache.org/jira/browse/TIKA-2563
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: consumentenbond.html, testHTML_embedded_img.html
>
>
> Files (esp images) and other objects can be embedded in html/css/javascript 
> with the [data: uri scheme|https://en.wikipedia.org/wiki/Data_URI_scheme].  
> We should extract those like any other embedded file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350614#comment-16350614
 ] 

Markus Jelsma commented on TIKA-2563:
-

Ah, thanks :)

> Extract embedded files in HTML
> --
>
> Key: TIKA-2563
> URL: https://issues.apache.org/jira/browse/TIKA-2563
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: consumentenbond.html, testHTML_embedded_img.html
>
>
> Files (esp images) can be base64 encoded in HTML files.  We should extract 
> those like any other embedded file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350606#comment-16350606
 ] 

Tim Allison commented on TIKA-2563:
---

Right.  Sorry.  I meant the {{testHTML_embedded_img.html}}, NOT the file you 
shared.  Thank you, again!

> Extract embedded files in HTML
> --
>
> Key: TIKA-2563
> URL: https://issues.apache.org/jira/browse/TIKA-2563
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: consumentenbond.html, testHTML_embedded_img.html
>
>
> Files (esp images) can be base64 encoded in HTML files.  We should extract 
> those like any other embedded file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350604#comment-16350604
 ] 

Markus Jelsma commented on TIKA-2563:
-

I am not sure if ASL 2.0 friendly would apply. I took it some time ago from a 
live page of a Dutch non-profift association, for test purposes. 

> Extract embedded files in HTML
> --
>
> Key: TIKA-2563
> URL: https://issues.apache.org/jira/browse/TIKA-2563
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: consumentenbond.html, testHTML_embedded_img.html
>
>
> Files (esp images) can be base64 encoded in HTML files.  We should extract 
> those like any other embedded file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350593#comment-16350593
 ] 

Tim Allison commented on TIKA-2563:
---

ASF 2.0 friendly example file based on example file kindly supplied by 
[~markus17]

> Extract embedded files in HTML
> --
>
> Key: TIKA-2563
> URL: https://issues.apache.org/jira/browse/TIKA-2563
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: consumentenbond.html, testHTML_embedded_img.html
>
>
> Files (esp images) can be base64 encoded in HTML files.  We should extract 
> those like any other embedded file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2563:
--
Attachment: testHTML_embedded_img.html

> Extract embedded files in HTML
> --
>
> Key: TIKA-2563
> URL: https://issues.apache.org/jira/browse/TIKA-2563
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: consumentenbond.html, testHTML_embedded_img.html
>
>
> Files (esp images) can be base64 encoded in HTML files.  We should extract 
> those like any other embedded file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350556#comment-16350556
 ] 

Tim Allison commented on TIKA-1599:
---

>Tim, if attached file is what you are looking for, i've got about 80 specimens 
>that came up when grepping for base64.

W00t!  Thank you!  That one should do...and, duh, grep for base64.  Thank you!

>On topic, our parser on top of Tika relies on a custom ContentHandler 
>implementation. We (my company) would not be too happy if we would have to 
>rewrite the whole thing. Same goes for Apache Nutch.

Oh...that's good to know...so I guess we're back to the option of supporting 
both Tagsoup and JSoup with users specifying via tika-config.xml which parser 
to use?

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1599-crazy-files.tar.gz, consumentenbond.html, 
> tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350555#comment-16350555
 ] 

Tim Allison commented on TIKA-2563:
---

Attached example file supplied by [~markus17] on TIKA-1599.  Thank you!

> Extract embedded files in HTML
> --
>
> Key: TIKA-2563
> URL: https://issues.apache.org/jira/browse/TIKA-2563
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: consumentenbond.html
>
>
> Files (esp images) can be base64 encoded in HTML files.  We should extract 
> those like any other embedded file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2563:
--
Attachment: consumentenbond.html

> Extract embedded files in HTML
> --
>
> Key: TIKA-2563
> URL: https://issues.apache.org/jira/browse/TIKA-2563
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: consumentenbond.html
>
>
> Files (esp images) can be base64 encoded in HTML files.  We should extract 
> those like any other embedded file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350550#comment-16350550
 ] 

Tim Allison commented on TIKA-2490:
---

Y, sorry.  We could change this behavior back to ignore missing 
dependencies...but I think [~pascal.essiembre] and [~mcaruanagalizia] had good 
arguments for why we should include warnings.

> Turn off stderr warnings in Tika-app
> 
>
> Key: TIKA-2490
> URL: https://issues.apache.org/jira/browse/TIKA-2490
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.16
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 1.17
>
> Attachments: NUTCH-2439-1.17.patch
>
>
> Let's get rid of the stderr messages in tika-app and confirm that users can 
> turn off warnings via tika-config.xml



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2563) Extract embedded files in HTML

2018-02-02 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2563:
-

 Summary: Extract embedded files in HTML
 Key: TIKA-2563
 URL: https://issues.apache.org/jira/browse/TIKA-2563
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison


Files (esp images) can be base64 encoded in HTML files.  We should extract 
those like any other embedded file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350545#comment-16350545
 ] 

Markus Jelsma commented on TIKA-1599:
-

On topic, our parser on top of Tika relies on a custom ContentHandler 
implementation. We (my company) would not be too happy if we would have to 
rewrite the whole thing. Same goes for Apache Nutch.

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1599-crazy-files.tar.gz, consumentenbond.html, 
> tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Andrei Rebegea (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350543#comment-16350543
 ] 

Andrei Rebegea commented on TIKA-2490:
--

OK. Thanks for the answer. 
So "Is this still suppose to happen ?" 
Answer: Yes. unless you make some modification to specify a tika-config.xml 
with this property set:  



> Turn off stderr warnings in Tika-app
> 
>
> Key: TIKA-2490
> URL: https://issues.apache.org/jira/browse/TIKA-2490
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.16
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 1.17
>
> Attachments: NUTCH-2439-1.17.patch
>
>
> Let's get rid of the stderr messages in tika-app and confirm that users can 
> turn off warnings via tika-config.xml



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350541#comment-16350541
 ] 

Markus Jelsma commented on TIKA-1599:
-

Tim, if attached file is what you are looking for, i've got about 80 specimens 
that came up when grepping for base64.

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1599-crazy-files.tar.gz, consumentenbond.html, 
> tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated TIKA-1599:

Attachment: consumentenbond.html

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1599-crazy-files.tar.gz, consumentenbond.html, 
> tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350538#comment-16350538
 ] 

Tim Allison commented on TIKA-2490:
---

Whoa, welcome to modernity. :)

>I am assuming that just by importing the libs, and using them(without special 
>configuration), we should not get these warnings.

Unfortunately, no, those warnings are supposed to be evident unless you turn 
them off...see TIKA-2232.

 

>From what I can tell from your links, you're using the default TikaConfig.  If 
>you can specify an actual tika-config.xml file, that should help.

> Turn off stderr warnings in Tika-app
> 
>
> Key: TIKA-2490
> URL: https://issues.apache.org/jira/browse/TIKA-2490
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.16
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 1.17
>
> Attachments: NUTCH-2439-1.17.patch
>
>
> Let's get rid of the stderr messages in tika-app and confirm that users can 
> turn off warnings via tika-config.xml



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350529#comment-16350529
 ] 

Tim Allison edited comment on TIKA-1599 at 2/2/18 3:39 PM:
---

>DOM could lead to higher memory usage

Y, that's my concern, esp because our CommonCrawl docs are truncated at 1MB so 
we aren't going to see major problems in that corpus.

 

I added [~markus17] 's attached files to our regression corpus, and I've kicked 
off a fresh full run of Tika 1.17 against the corpus.  I've updated my jsoup 
code on my personal fork.  Once the 1.17 run finishes, I'll kick off the jsoup 
fork against the html files.

 

Unrelated topic: does anyone have a shareable example of an html file with a 
base64 (or other) embedded file inside of an html file?  I don't think we're 
currently handling these, and it would be nice to do that.


was (Author: talli...@mitre.org):
>DOM could lead to higher memory usage

Y, that's my concern, esp because our CommonCrawl docs are truncated at 1MB so 
we aren't going to see major problems in that corpus.

 

I've kicked off a fresh full run of Tika 1.17 against the corpus, and I've 
updated my jsoup code on my personal fork.  Once the 1.17 run finishes, I'll 
kick off the jsoup fork against the html files.

 

Unrelated topic: does anyone have a shareable example of an html file with a 
base64 (or other) embedded file inside of an html file?  I don't think we're 
currently handling these, and it would be nice to do that.

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1599-crazy-files.tar.gz, 
> tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350529#comment-16350529
 ] 

Tim Allison commented on TIKA-1599:
---

>DOM could lead to higher memory usage

Y, that's my concern, esp because our CommonCrawl docs are truncated at 1MB so 
we aren't going to see major problems in that corpus.

 

I've kicked off a fresh full run of Tika 1.17 against the corpus, and I've 
updated my jsoup code on my personal fork.  Once the 1.17 run finishes, I'll 
kick off the jsoup fork against the html files.

 

Unrelated topic: does anyone have a shareable example of an html file with a 
base64 (or other) embedded file inside of an html file?  I don't think we're 
currently handling these, and it would be nice to do that.

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1599-crazy-files.tar.gz, 
> tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Andrei Rebegea (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350528#comment-16350528
 ] 

Andrei Rebegea commented on TIKA-2490:
--

The short answer is that I don't know the full details, sorry. I have not 
worked on the implementation part - only on the upgrade part. 

We just updated our tika from version 1.6 (yes 6 not 16) to version 1.17, so we 
are using tika just like we have been using it when it was on version 1.6.
I am assuming that just by importing the libs, and using them(without special 
configuration), we should not get these warnings.

We have a TikaConfig object that seems to be shared around : 
[here|https://github.com/Alfresco/alfresco-repository/blob/master/src/main/resources/alfresco/content-services-context.xml#L180]
 then we use it : for example 
[here|https://github.com/Alfresco/alfresco-repository/blob/master/src/main/resources/alfresco/content-services-context.xml#L292]
and we instantiate the new AutoDetectParser : for example 
[here|https://github.com/Alfresco/alfresco-repository/blob/master/src/main/java/org/alfresco/repo/content/metadata/TikaAutoMetadataExtracter.java#L78]
 

> Turn off stderr warnings in Tika-app
> 
>
> Key: TIKA-2490
> URL: https://issues.apache.org/jira/browse/TIKA-2490
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.16
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 1.17
>
> Attachments: NUTCH-2439-1.17.patch
>
>
> Let's get rid of the stderr messages in tika-app and confirm that users can 
> turn off warnings via tika-config.xml



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350504#comment-16350504
 ] 

Luis Filipe Nassif commented on TIKA-1599:
--

Hi [~talli...@mitre.org],

Moving to DOM could lead to higher memory usage and maybe bring memory problems 
like those we had experienced with the Office DOM parsers. But given all the 
problems of TagSoup, I think it is worth doing a new evaluation to see if we 
can get more content and the lost metadata back (from your previous test).

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1599-crazy-files.tar.gz, 
> tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350420#comment-16350420
 ] 

Tim Allison commented on TIKA-2490:
---

No, this isn't supposed to happen if you use the example {{tika-config.xml}} 
above.  How are you calling Tika?

> Turn off stderr warnings in Tika-app
> 
>
> Key: TIKA-2490
> URL: https://issues.apache.org/jira/browse/TIKA-2490
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.16
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 1.17
>
> Attachments: NUTCH-2439-1.17.patch
>
>
> Let's get rid of the stderr messages in tika-app and confirm that users can 
> turn off warnings via tika-config.xml



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2490:
--
Fix Version/s: 1.17

> Turn off stderr warnings in Tika-app
> 
>
> Key: TIKA-2490
> URL: https://issues.apache.org/jira/browse/TIKA-2490
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.16
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 1.17
>
> Attachments: NUTCH-2439-1.17.patch
>
>
> Let's get rid of the stderr messages in tika-app and confirm that users can 
> turn off warnings via tika-config.xml



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-02 Thread Andrei Rebegea (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350418#comment-16350418
 ] 

Andrei Rebegea commented on TIKA-2490:
--

Hello,
I am using tika version 1.17 and still getting these warnings at startup.
{code}
Feb 02, 2018 11:09:40 AM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
TIFFImageWriter not loaded. tiff files will not be processed
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Feb 02, 2018 11:09:40 AM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
{code}

I don't see a fix version on this task, so I don't know if it made it to 1.17, 
so, my questions: *Is this still suppose to happen ?*

> Turn off stderr warnings in Tika-app
> 
>
> Key: TIKA-2490
> URL: https://issues.apache.org/jira/browse/TIKA-2490
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.16
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Attachments: NUTCH-2439-1.17.patch
>
>
> Let's get rid of the stderr messages in tika-app and confirm that users can 
> turn off warnings via tika-config.xml



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2561) Tika Parser includes oudated/vulnerable version of JSoup

2018-02-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350406#comment-16350406
 ] 

Hudson commented on TIKA-2561:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1429 (See 
[https://builds.apache.org/job/Tika-trunk/1429/])
TIKA-2561 -- update jsoup version in grib parser to avoid xss vuln (tallison: 
[https://github.com/apache/tika/commit/c80241952fa2f515687c6479768d24d7e907653c])
* (edit) tika-parsers/pom.xml


> Tika Parser includes oudated/vulnerable version of JSoup
> 
>
> Key: TIKA-2561
> URL: https://issues.apache.org/jira/browse/TIKA-2561
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Asela
>Priority: Major
> Fix For: 2.0, 1.18
>
>
> org.apache.tika:tika-parsers:1.17 pulls in dependency JSoup 1.7.2.
>  
> JSoup versions older than 1.8.3 have a vulnerability in parsing.
>  
> https://nvd.nist.gov/vuln/detail/CVE-2015-6748



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350335#comment-16350335
 ] 

Tim Allison edited comment on TIKA-1599 at 2/2/18 1:42 PM:
---

What say we do a fresh eval on our current corpus and then do a clean cut over 
to JSoup for Tika 2.0 if the results are promising?

Big question: are we willing to move to DOM for HTML.  SAX is not yet available 
in JSoup (https://github.com/jhy/jsoup/issues/824).


was (Author: talli...@mitre.org):
What say we do a fresh eval on our current corpus and then do a clean cut over 
to JSoup for Tika 2.0 if the results are promising?

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1599-crazy-files.tar.gz, 
> tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350335#comment-16350335
 ] 

Tim Allison commented on TIKA-1599:
---

What say we do a fresh eval on our current corpus and then do a clean cut over 
to JSoup for Tika 2.0 if the results are promising?

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1599-crazy-files.tar.gz, 
> tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350330#comment-16350330
 ] 

Tim Allison commented on TIKA-2562:
---

This is a "feature" of tagsoup see, e.g. 
[https://groups.google.com/forum/#!topic/tagsoup-friends/EfB6i12xBLw] I'm 
hesitant to fix this in Tika because we should probably migrate to jsoup, which 
is actively supported (TIKA-1599).

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2561) Tika Parser includes oudated/vulnerable version of JSoup

2018-02-02 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2561.
---
   Resolution: Fixed
Fix Version/s: 1.18
   2.0

> Tika Parser includes oudated/vulnerable version of JSoup
> 
>
> Key: TIKA-2561
> URL: https://issues.apache.org/jira/browse/TIKA-2561
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Asela
>Priority: Major
> Fix For: 2.0, 1.18
>
>
> org.apache.tika:tika-parsers:1.17 pulls in dependency JSoup 1.7.2.
>  
> JSoup versions older than 1.8.3 have a vulnerability in parsing.
>  
> https://nvd.nist.gov/vuln/detail/CVE-2015-6748



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2561) Tika Parser includes oudated/vulnerable version of JSoup

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350302#comment-16350302
 ] 

Tim Allison commented on TIKA-2561:
---

This is helpful.  It boggles my imagination that this could be a problem for 
the grib parser in our context, but I've had failures of imagination before, 
and it is better to include deps that don't have known vulns in case another 
parser winds up pulling it in or in case my imagination fails :).  Upgrade 
made.  Thank you!

> Tika Parser includes oudated/vulnerable version of JSoup
> 
>
> Key: TIKA-2561
> URL: https://issues.apache.org/jira/browse/TIKA-2561
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Asela
>Priority: Major
>
> org.apache.tika:tika-parsers:1.17 pulls in dependency JSoup 1.7.2.
>  
> JSoup versions older than 1.8.3 have a vulnerability in parsing.
>  
> https://nvd.nist.gov/vuln/detail/CVE-2015-6748



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)