[jira] [Commented] (TIKA-2744) rss+xml doesnt accept files with .xml extension

2018-10-17 Thread Martin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654681#comment-16654681
 ] 

Martin commented on TIKA-2744:
--

Hello Guys, 

 

I apologize for late comment. I added attachment to this bug report. (actually 
you can test it on any Jira issue page exported to XML)

 

Br. Martin

> rss+xml doesnt accept files with .xml extension
> ---
>
> Key: TIKA-2744
> URL: https://issues.apache.org/jira/browse/TIKA-2744
> Project: Tika
>  Issue Type: Bug
>Reporter: Martin
>Priority: Major
> Attachments: rsstest.xml
>
>
> Hello, 
> if i try to validate application/rss+xml file with .xml extension and it 
> fails. 
> I would say, that is a bug.
> I think the .RSS extension is only until version 1.0. From 2.0 is rss xml 
> based and it should(could) have .xml extension:
> Source:
> https://www.w3schools.com/xml/xml_rss.asp 
> "Get Your RSS Feed Up On The Web
> Having an RSS document is not useful if other people cannot reach it.
> Now it's time to get your RSS file up on the web. Here are the steps:
> 1. Name your RSS file. Notice that the file must have an .xml extension."
> or specification on Harvard university:
> https://cyber.harvard.edu/rss/rss.html
> there is example:
> "Its value is the name of the RSS channel that the item came from, derived 
> from its . It has one required attribute, url, which links to the 
> XMLization of the source.
> Example of file:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2744) rss+xml doesnt accept files with .xml extension

2018-10-17 Thread Martin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin updated TIKA-2744:
-
Attachment: rsstest.xml

> rss+xml doesnt accept files with .xml extension
> ---
>
> Key: TIKA-2744
> URL: https://issues.apache.org/jira/browse/TIKA-2744
> Project: Tika
>  Issue Type: Bug
>Reporter: Martin
>Priority: Major
> Attachments: rsstest.xml
>
>
> Hello, 
> if i try to validate application/rss+xml file with .xml extension and it 
> fails. 
> I would say, that is a bug.
> I think the .RSS extension is only until version 1.0. From 2.0 is rss xml 
> based and it should(could) have .xml extension:
> Source:
> https://www.w3schools.com/xml/xml_rss.asp 
> "Get Your RSS Feed Up On The Web
> Having an RSS document is not useful if other people cannot reach it.
> Now it's time to get your RSS file up on the web. Here are the steps:
> 1. Name your RSS file. Notice that the file must have an .xml extension."
> or specification on Harvard university:
> https://cyber.harvard.edu/rss/rss.html
> there is example:
> "Its value is the name of the RSS channel that the item came from, derived 
> from its . It has one required attribute, url, which links to the 
> XMLization of the source.
> Example of file:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2734) Tika addes extra characters at the end of text in extracting from excel file

2018-10-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654394#comment-16654394
 ] 

Tim Allison commented on TIKA-2734:
---

It will not.

Let us know if you have any surprises.

> Tika addes extra characters at the end of text in extracting from excel file
> 
>
> Key: TIKA-2734
> URL: https://issues.apache.org/jira/browse/TIKA-2734
> Project: Tika
>  Issue Type: Bug
>  Components: handler
>Affects Versions: 1.18
>Reporter: feng ye
>Priority: Major
> Attachments: AIRPORTSOK.xls, extra_A_Page_P.png
>
>
> when extracting text from some relatively large excel files (9000 rows or 
> so), I found an extra string of " PAGE " is added to the end of the 
> resulting text, when Tika.parseToString is called. Is it a known issue? Is 
> there any configuration that I can do that will opt out from outputting these 
> extra characters?
> did not find a good answer over google. 
> the input excel spreadsheet is attached. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2734) Tika addes extra characters at the end of text in extracting from excel file

2018-10-17 Thread feng ye (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654386#comment-16654386
 ] 

feng ye commented on TIKA-2734:
---

Thanks Tim for your detailed tips. 

I am using Tika to extract all kinds of documents, not only MS Office ones. So 
I was afraid that applying OfficeParseConfig with each call would affect 
non-Office file processing. Are you saying it would not? 

> Tika addes extra characters at the end of text in extracting from excel file
> 
>
> Key: TIKA-2734
> URL: https://issues.apache.org/jira/browse/TIKA-2734
> Project: Tika
>  Issue Type: Bug
>  Components: handler
>Affects Versions: 1.18
>Reporter: feng ye
>Priority: Major
> Attachments: AIRPORTSOK.xls, extra_A_Page_P.png
>
>
> when extracting text from some relatively large excel files (9000 rows or 
> so), I found an extra string of " PAGE " is added to the end of the 
> resulting text, when Tika.parseToString is called. Is it a known issue? Is 
> there any configuration that I can do that will opt out from outputting these 
> extra characters?
> did not find a good answer over google. 
> the input excel spreadsheet is attached. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2577) Sonatype Nexus Auditor is reporting that the Bouncy castle version used by Tika 1.17 is vulnerable

2018-10-17 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2577.
---
   Resolution: Fixed
Fix Version/s: 1.19

> Sonatype Nexus Auditor is reporting that the Bouncy castle version used by 
> Tika 1.17 is vulnerable
> --
>
> Key: TIKA-2577
> URL: https://issues.apache.org/jira/browse/TIKA-2577
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
>Reporter: Abhijit Rajwade
>Priority: Major
> Fix For: 1.19
>
>
> Sonatype Nexus Auditor is reporting that the Bouncy castle version used by 
> Tika 1.17 (tika-app-1.17.jar) is vulnerable.
> Here are the details of CVE-2016-1000341.
>  
> *Explanation*
> {{BouncyCastle}} is vulnerable to a Timing Attack. The 
> {{generateSignature()}} function in the {{DSASigner.java}} file allows the 
> per message key (the {{k}} value in the DSA algorithm) to be predictable 
> while generating DSA signatures. A remote attacker can exploit this 
> vulnerability to determine the {{k}} value by closely observing the timings 
> for the generation of signatures, allowing the attacker to deduce the 
> signer?s private key.
> Detection
> The application is vulnerable by using this component.
>  
> *Recommendation*
> We recommend upgrading to a version of this component that is not vulnerable 
> to this specific issue.
> Categories
> Data
>  
> *Root Cause*
> tika-app-1.17.jar *<=* DSASigner.class : (, 1.56)
> tika-app-1.17.jar *<=* DSASigner.class : (,1.56)
> Advisories
> Third Party: 
> [https://rdist.root.org/2010/11/19/dsa-requirements-for-rando...|https://rdist.root.org/2010/11/19/dsa-requirements-for-random-k-value/]
> Project: [https://www.bouncycastle.org/releasenotes.html]
>  
> *Resolution*
> Refer [https://www.bouncycastle.org/releasenotes.html]
> You can see that Bouncy caste version 1.56 fixes CVE-2016-1000341
> Recommend that Apach Tika upgrade Bouncy Castle to version 1.56 or latyer.
> --- Abhijit Rajwade
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2577) Sonatype Nexus Auditor is reporting that the Bouncy castle version used by Tika 1.17 is vulnerable

2018-10-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654010#comment-16654010
 ] 

Tim Allison commented on TIKA-2577:
---

Agreed. Tika 1.19.1 uses BouncyCastle 1.60.  I just added the 
{{versions-maven-plugin}} so that we can track updates more easily.

> Sonatype Nexus Auditor is reporting that the Bouncy castle version used by 
> Tika 1.17 is vulnerable
> --
>
> Key: TIKA-2577
> URL: https://issues.apache.org/jira/browse/TIKA-2577
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
>Reporter: Abhijit Rajwade
>Priority: Major
>
> Sonatype Nexus Auditor is reporting that the Bouncy castle version used by 
> Tika 1.17 (tika-app-1.17.jar) is vulnerable.
> Here are the details of CVE-2016-1000341.
>  
> *Explanation*
> {{BouncyCastle}} is vulnerable to a Timing Attack. The 
> {{generateSignature()}} function in the {{DSASigner.java}} file allows the 
> per message key (the {{k}} value in the DSA algorithm) to be predictable 
> while generating DSA signatures. A remote attacker can exploit this 
> vulnerability to determine the {{k}} value by closely observing the timings 
> for the generation of signatures, allowing the attacker to deduce the 
> signer?s private key.
> Detection
> The application is vulnerable by using this component.
>  
> *Recommendation*
> We recommend upgrading to a version of this component that is not vulnerable 
> to this specific issue.
> Categories
> Data
>  
> *Root Cause*
> tika-app-1.17.jar *<=* DSASigner.class : (, 1.56)
> tika-app-1.17.jar *<=* DSASigner.class : (,1.56)
> Advisories
> Third Party: 
> [https://rdist.root.org/2010/11/19/dsa-requirements-for-rando...|https://rdist.root.org/2010/11/19/dsa-requirements-for-random-k-value/]
> Project: [https://www.bouncycastle.org/releasenotes.html]
>  
> *Resolution*
> Refer [https://www.bouncycastle.org/releasenotes.html]
> You can see that Bouncy caste version 1.56 fixes CVE-2016-1000341
> Recommend that Apach Tika upgrade Bouncy Castle to version 1.56 or latyer.
> --- Abhijit Rajwade
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2577) Sonatype Nexus Auditor is reporting that the Bouncy castle version used by Tika 1.17 is vulnerable

2018-10-17 Thread Andrew Pavlin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653994#comment-16653994
 ] 

Andrew Pavlin commented on TIKA-2577:
-

I have to agree with the comment. Next build should include the latest 
BouncyCastle release, so as to avoid CVE issues. After all, just because Tika 
isn't using the vulnerable parts of BouncyCastle doesn't mean other parts of 
the application using Tika couldn't call the defective BouncyCastle code.

> Sonatype Nexus Auditor is reporting that the Bouncy castle version used by 
> Tika 1.17 is vulnerable
> --
>
> Key: TIKA-2577
> URL: https://issues.apache.org/jira/browse/TIKA-2577
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
>Reporter: Abhijit Rajwade
>Priority: Major
>
> Sonatype Nexus Auditor is reporting that the Bouncy castle version used by 
> Tika 1.17 (tika-app-1.17.jar) is vulnerable.
> Here are the details of CVE-2016-1000341.
>  
> *Explanation*
> {{BouncyCastle}} is vulnerable to a Timing Attack. The 
> {{generateSignature()}} function in the {{DSASigner.java}} file allows the 
> per message key (the {{k}} value in the DSA algorithm) to be predictable 
> while generating DSA signatures. A remote attacker can exploit this 
> vulnerability to determine the {{k}} value by closely observing the timings 
> for the generation of signatures, allowing the attacker to deduce the 
> signer?s private key.
> Detection
> The application is vulnerable by using this component.
>  
> *Recommendation*
> We recommend upgrading to a version of this component that is not vulnerable 
> to this specific issue.
> Categories
> Data
>  
> *Root Cause*
> tika-app-1.17.jar *<=* DSASigner.class : (, 1.56)
> tika-app-1.17.jar *<=* DSASigner.class : (,1.56)
> Advisories
> Third Party: 
> [https://rdist.root.org/2010/11/19/dsa-requirements-for-rando...|https://rdist.root.org/2010/11/19/dsa-requirements-for-random-k-value/]
> Project: [https://www.bouncycastle.org/releasenotes.html]
>  
> *Resolution*
> Refer [https://www.bouncycastle.org/releasenotes.html]
> You can see that Bouncy caste version 1.56 fixes CVE-2016-1000341
> Recommend that Apach Tika upgrade Bouncy Castle to version 1.56 or latyer.
> --- Abhijit Rajwade
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


tika-2.x-windows - Build # 336 - Still Failing

2018-10-17 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #336)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/336/ to 
view the results.

[jira] [Commented] (TIKA-2756) Switch to commons-lang 3

2018-10-17 Thread Robert Munteanu (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653819#comment-16653819
 ] 

Robert Munteanu commented on TIKA-2756:
---

Thanks for looking into this [~talli...@apache.org]!

> Switch to commons-lang 3
> 
>
> Key: TIKA-2756
> URL: https://issues.apache.org/jira/browse/TIKA-2756
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robert Munteanu
>Priority: Major
>
> Tika 1.9.1 is using the legacy commons-lang 2.x series. This series is not 
> going to receive updates anymore and is completely superseded by commons-lang 
> 3.x .
> Projects that use Tika are blocked from dropping commons-lang 2.x due to this 
> dependency.
> The link that I found was from tika-parsers to jackcess and then to 
> commons-lang 2.6
> {noformat}
> [INFO] +- com.healthmarketscience.jackcess:jackcess:jar:2.1.12:compile
> [INFO] |  \- commons-lang:commons-lang:jar:2.6:compile
> {noformat}
> If I understand correctly, this is the only commons-lang 2.x dependency from 
> the Tika runtime and it would be great to remove it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2757) Add versions-maven-plugin

2018-10-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653814#comment-16653814
 ] 

Hudson commented on TIKA-2757:
--

ABORTED: Integrated in Jenkins build Tika-trunk #1580 (See 
[https://builds.apache.org/job/Tika-trunk/1580/])
TIKA-2757 -- add versions plugin (tallison: 
[https://github.com/apache/tika/commit/5310f17901f8f6732900472eb20c436bad79779a])
* (edit) tika-parent/pom.xml


> Add versions-maven-plugin 
> --
>
> Key: TIKA-2757
> URL: https://issues.apache.org/jira/browse/TIKA-2757
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
>
> Embarrassed I didn't know about this plugin.  :D
> Very, very helpful...
> {noformat}
> mvn versions:display-plugin-updates
> mvn versions:display-dependency-updates
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2744) rss+xml doesnt accept files with .xml extension

2018-10-17 Thread Nick Burch (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653808#comment-16653808
 ] 

Nick Burch commented on TIKA-2744:
--

I've added a test RSS 2.0 file to Tika's test documents, and it's correctly 
detected for me whether called {{rsstest_20.rss}} or {{rsstest_20.rss.xml}}

Can you give us some more details on how you're calling Tika, what file(s) 
you're having the trouble with, and exactly what isn't working?

> rss+xml doesnt accept files with .xml extension
> ---
>
> Key: TIKA-2744
> URL: https://issues.apache.org/jira/browse/TIKA-2744
> Project: Tika
>  Issue Type: Bug
>Reporter: Martin
>Priority: Major
>
> Hello, 
> if i try to validate application/rss+xml file with .xml extension and it 
> fails. 
> I would say, that is a bug.
> I think the .RSS extension is only until version 1.0. From 2.0 is rss xml 
> based and it should(could) have .xml extension:
> Source:
> https://www.w3schools.com/xml/xml_rss.asp 
> "Get Your RSS Feed Up On The Web
> Having an RSS document is not useful if other people cannot reach it.
> Now it's time to get your RSS file up on the web. Here are the steps:
> 1. Name your RSS file. Notice that the file must have an .xml extension."
> or specification on Harvard university:
> https://cyber.harvard.edu/rss/rss.html
> there is example:
> "Its value is the name of the RSS channel that the item came from, derived 
> from its . It has one required attribute, url, which links to the 
> XMLization of the source.
> Example of file:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Nick Burch (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653804#comment-16653804
 ] 

Nick Burch commented on TIKA-2543:
--

Great find Tim! Looks like an excellent resource on this.

Assuming access to a Mac so you have the {{plutil}} tool to be able to generate 
(and check!) a bunch of representative test files, and helped by the various 
Tika IO helpers we have, my hunch is it'd be about a day's work to add support 
for the binary plist format + test it properly + wire into Tika. Maybe allow 2 
days if new to Tika and/or new to decoding binary file formats.

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive, tika.plist
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2543:
--
Attachment: tika.plist

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive, tika.plist
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653764#comment-16653764
 ] 

Tim Allison commented on TIKA-2543:
---

https://medium.com/@karaiskc/understanding-apples-binary-property-list-format-281e6da00dbd

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653725#comment-16653725
 ] 

Tim Allison commented on TIKA-2543:
---

Still on lookout for Java parser with an Apache friendly license that parses 
binary plists.  commons-configuration handles flat text and xml, but not the 
more modern binary one.

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2757) Add versions-maven-plugin

2018-10-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653721#comment-16653721
 ] 

Hudson commented on TIKA-2757:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #117 (See 
[https://builds.apache.org/job/tika-branch-1x/117/])
TIKA-2757 -- add versions plugin (tallison: 
[https://github.com/apache/tika/commit/889c2c99d9a6690fc5bc8b8135ebba66fbdd0772])
* (edit) tika-parent/pom.xml


> Add versions-maven-plugin 
> --
>
> Key: TIKA-2757
> URL: https://issues.apache.org/jira/browse/TIKA-2757
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
>
> Embarrassed I didn't know about this plugin.  :D
> Very, very helpful...
> {noformat}
> mvn versions:display-plugin-updates
> mvn versions:display-dependency-updates
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2756) Switch to commons-lang 3

2018-10-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653720#comment-16653720
 ] 

Hudson commented on TIKA-2756:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #117 (See 
[https://builds.apache.org/job/tika-branch-1x/117/])
TIKA-2756 -- factor out code that relies on the old commons-lang... once 
(tallison: 
[https://github.com/apache/tika/commit/65d18af4a0cae4838e9e1a33b3c3e1eda55f5b28])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRConfigTest.java
* (edit) tika-server/pom.xml
* (edit) tika-bundle/pom.xml
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessOleUtil.java
* (edit) tika-parent/pom.xml
* (edit) 
tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
* (edit) tika-parsers/pom.xml
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessCompoundOleUtil.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/XUserDefinedCharset.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/BouncyCastleDigestingParserTest.java
* (edit) tika-dl/pom.xml
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/font/FontParsersTest.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java
* (edit) tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/recognition/ObjectRecognitionParserTest.java
* (edit) tika-eval/pom.xml


> Switch to commons-lang 3
> 
>
> Key: TIKA-2756
> URL: https://issues.apache.org/jira/browse/TIKA-2756
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robert Munteanu
>Priority: Major
>
> Tika 1.9.1 is using the legacy commons-lang 2.x series. This series is not 
> going to receive updates anymore and is completely superseded by commons-lang 
> 3.x .
> Projects that use Tika are blocked from dropping commons-lang 2.x due to this 
> dependency.
> The link that I found was from tika-parsers to jackcess and then to 
> commons-lang 2.6
> {noformat}
> [INFO] +- com.healthmarketscience.jackcess:jackcess:jar:2.1.12:compile
> [INFO] |  \- commons-lang:commons-lang:jar:2.6:compile
> {noformat}
> If I understand correctly, this is the only commons-lang 2.x dependency from 
> the Tika runtime and it would be great to remove it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653718#comment-16653718
 ] 

Tim Allison commented on TIKA-2543:
---

TIKA-1358 might be relevant.  We don't currently parse modern Apple files. :(

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Rafael Ferreira (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653715#comment-16653715
 ] 

Rafael Ferreira commented on TIKA-2543:
---

If someone can point in the general area of the problem, I'm happy to try to 
get a PR out myself.

Could It be a mime identification issue causing the correct parser to not be 
called? 

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Rafael Ferreira (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653711#comment-16653711
 ] 

Rafael Ferreira commented on TIKA-2543:
---

This seems like a more widespread issue than I imagined, extracting content 
from any plist seems to not work ATM, trying to parse a Pages file (pages 
version 7.2) triggers the EmptyParser and no text extracted. 

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


tika-2.x-windows - Build # 335 - Still Failing

2018-10-17 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #335)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/335/ to 
view the results.

[jira] [Commented] (TIKA-2757) Add versions-maven-plugin

2018-10-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653614#comment-16653614
 ] 

Hudson commented on TIKA-2757:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #335 (See 
[https://builds.apache.org/job/tika-2.x-windows/335/])
TIKA-2757 -- add versions plugin (tallison: rev 
5310f17901f8f6732900472eb20c436bad79779a)
* (edit) tika-parent/pom.xml


> Add versions-maven-plugin 
> --
>
> Key: TIKA-2757
> URL: https://issues.apache.org/jira/browse/TIKA-2757
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
>
> Embarrassed I didn't know about this plugin.  :D
> Very, very helpful...
> {noformat}
> mvn versions:display-plugin-updates
> mvn versions:display-dependency-updates
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2756) Switch to commons-lang 3

2018-10-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653613#comment-16653613
 ] 

Hudson commented on TIKA-2756:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1579 (See 
[https://builds.apache.org/job/Tika-trunk/1579/])
TIKA-2756 -- factor out code that relies on the old commons-lang... once 
(tallison: 
[https://github.com/apache/tika/commit/86e997510b44f12dc9f90a68aaf583d5d3912892])
* (edit) 
tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java
* (edit) tika-server/pom.xml
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/font/FontParsersTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/recognition/ObjectRecognitionParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessCompoundOleUtil.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/BouncyCastleDigestingParserTest.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
* (edit) tika-dl/pom.xml
* (edit) 
tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/XUserDefinedCharset.java
* (edit) tika-bundle/pom.xml
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRConfigTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* (edit) tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java
* (edit) tika-eval/pom.xml
* (edit) tika-parent/pom.xml
* (edit) tika-parsers/pom.xml
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessOleUtil.java


> Switch to commons-lang 3
> 
>
> Key: TIKA-2756
> URL: https://issues.apache.org/jira/browse/TIKA-2756
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robert Munteanu
>Priority: Major
>
> Tika 1.9.1 is using the legacy commons-lang 2.x series. This series is not 
> going to receive updates anymore and is completely superseded by commons-lang 
> 3.x .
> Projects that use Tika are blocked from dropping commons-lang 2.x due to this 
> dependency.
> The link that I found was from tika-parsers to jackcess and then to 
> commons-lang 2.6
> {noformat}
> [INFO] +- com.healthmarketscience.jackcess:jackcess:jar:2.1.12:compile
> [INFO] |  \- commons-lang:commons-lang:jar:2.6:compile
> {noformat}
> If I understand correctly, this is the only commons-lang 2.x dependency from 
> the Tika runtime and it would be great to remove it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2756) Switch to commons-lang 3

2018-10-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653603#comment-16653603
 ] 

Tim Allison commented on TIKA-2756:
---

I refactored the parts of our code that rely on {{commons-lang}}.  Once 
Jackcess upgrades to {{lang3}}, we'll be good to go.  I also upgraded Jackcess 
while I was at it...

> Switch to commons-lang 3
> 
>
> Key: TIKA-2756
> URL: https://issues.apache.org/jira/browse/TIKA-2756
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robert Munteanu
>Priority: Major
>
> Tika 1.9.1 is using the legacy commons-lang 2.x series. This series is not 
> going to receive updates anymore and is completely superseded by commons-lang 
> 3.x .
> Projects that use Tika are blocked from dropping commons-lang 2.x due to this 
> dependency.
> The link that I found was from tika-parsers to jackcess and then to 
> commons-lang 2.6
> {noformat}
> [INFO] +- com.healthmarketscience.jackcess:jackcess:jar:2.1.12:compile
> [INFO] |  \- commons-lang:commons-lang:jar:2.6:compile
> {noformat}
> If I understand correctly, this is the only commons-lang 2.x dependency from 
> the Tika runtime and it would be great to remove it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2757) Add versions-maven-plugin

2018-10-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653558#comment-16653558
 ] 

Tim Allison commented on TIKA-2757:
---

When I run {{versions:display-plug-in-updates}}, I get this error message:

{noformat}
[ERROR] Project requires an incorrect minimum version of Maven.
[ERROR] Update the pom.xml to contain maven-enforcer-plugin to
[ERROR] force the maven version which is needed to build this project.
[ERROR] See 
https://maven.apache.org/enforcer/enforcer-rules/requireMavenVersion.html
[ERROR] Using the minimum version of Maven: 3.0
[INFO] Project inherits minimum Maven version as: 3.0
[INFO] Plugins require minimum Maven version of: 3.0.5
{noformat}

Are we ok, specifying a minimum Maven version?  If so, what do we want to set?

> Add versions-maven-plugin 
> --
>
> Key: TIKA-2757
> URL: https://issues.apache.org/jira/browse/TIKA-2757
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
>
> Embarrassed I didn't know about this plugin.  :D
> Very, very helpful...
> {noformat}
> mvn versions:display-plugin-updates
> mvn versions:display-dependency-updates
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2757) Add versions-maven-plugin

2018-10-17 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2757:
-

 Summary: Add versions-maven-plugin 
 Key: TIKA-2757
 URL: https://issues.apache.org/jira/browse/TIKA-2757
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison
Assignee: Tim Allison


Embarrassed I didn't know about this plugin.  :D

Very, very helpful...

{noformat}
mvn versions:display-plugin-updates
mvn versions:display-dependency-updates
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2756) Switch to commons-lang 3

2018-10-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653529#comment-16653529
 ] 

Hudson commented on TIKA-2756:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #334 (See 
[https://builds.apache.org/job/tika-2.x-windows/334/])
TIKA-2756 -- factor out code that relies on the old commons-lang... once 
(tallison: rev 86e997510b44f12dc9f90a68aaf583d5d3912892)
* (edit) tika-parsers/pom.xml
* (edit) tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/BouncyCastleDigestingParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessOleUtil.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/font/FontParsersTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRConfigTest.java
* (edit) tika-bundle/pom.xml
* (edit) tika-dl/pom.xml
* (edit) 
tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
* (edit) tika-server/pom.xml
* (edit) tika-eval/pom.xml
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessCompoundOleUtil.java
* (edit) tika-parent/pom.xml
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/XUserDefinedCharset.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/recognition/ObjectRecognitionParserTest.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/TikaServerWatchDog.java


> Switch to commons-lang 3
> 
>
> Key: TIKA-2756
> URL: https://issues.apache.org/jira/browse/TIKA-2756
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robert Munteanu
>Priority: Major
>
> Tika 1.9.1 is using the legacy commons-lang 2.x series. This series is not 
> going to receive updates anymore and is completely superseded by commons-lang 
> 3.x .
> Projects that use Tika are blocked from dropping commons-lang 2.x due to this 
> dependency.
> The link that I found was from tika-parsers to jackcess and then to 
> commons-lang 2.6
> {noformat}
> [INFO] +- com.healthmarketscience.jackcess:jackcess:jar:2.1.12:compile
> [INFO] |  \- commons-lang:commons-lang:jar:2.6:compile
> {noformat}
> If I understand correctly, this is the only commons-lang 2.x dependency from 
> the Tika runtime and it would be great to remove it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


tika-2.x-windows - Build # 334 - Failure

2018-10-17 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #334)

Status: Failure

Check console output at https://builds.apache.org/job/tika-2.x-windows/334/ to 
view the results.

[jira] [Commented] (TIKA-2755) Allow Tika to skip extraction of tags in HTML

2018-10-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653517#comment-16653517
 ] 

Tim Allison commented on TIKA-2755:
---

Doh. My fault, not yours.  tika-server uses the BoilerpipeContentHandler for 
the /tika endpoint.  As you observe, this handler includes the markup.

The /rmeta/text endpoint uses the ToTextHandler and returns the content without 
the markup.
{noformat}
curl -T TestForImageTag.html http://localhost:9998/rmeta/text
[{"Content-Encoding":"windows-1252","Content-Type":"text/html; 
charset\u003dwindows-1252","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\n\nThis
 is a test\n","X-TIKA:parse_time_millis":"9"}]
{noformat}

The downside is that you then have to parse the json and extract the content.

Fellow devs, any idea why we use the BoilerPipeHandler in {{/tika}} and not the 
ToTextHandler?


> Allow Tika to skip extraction of  tags in HTML
> ---
>
> Key: TIKA-2755
> URL: https://issues.apache.org/jira/browse/TIKA-2755
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.19.1
>Reporter: Harinder
>Priority: Major
> Attachments: TestForImageTag.html
>
>
> We are using Tika Server to extract text from HTML files. Tika extracts the 
> alt text of image tags present in HTML files as _[image: this is the alt text 
> of the image]_. This ends up in Solr and shows up in the results when we 
> generate document summaries at query time (via Solr’s highlight 
> functionality).
> If you PUT the attached HTML file to /tika, it will return the following 
> response
> {code:java}
> [image: Return to the homepage]
> This is a test{code}
> It would be nice to have just this instead
> {code:java}
> This is a test {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2734) Tika addes extra characters at the end of text in extracting from excel file

2018-10-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653483#comment-16653483
 ] 

Tim Allison commented on TIKA-2734:
---

The facade method of calling Tika doesn't include a ParseContext, so you have 
to go the old fashioned way:
{noformat}
Parser p = new AutoDetectParser();

ContentHandler handler = new ToTextContentHandler();
Metadata m = new Metadata();
ParseContext parseContext = new ParseContext();
OfficeParserConfig config = new OfficeParserConfig();
config.setIncludeHeadersAndFooters(false);
parseContext.set(OfficeParserConfig.class, config);

try (TikaInputStream tis = TikaInputStream.get(bytes)) {
   p.parse(tis, m, handler, parseContext);
}
String result = handler.toString();
{noformat}

Notes:
* You don't need to detect first and do an "if" call to determine whether or 
not to add the OfficeParserConfig.  I'd add the OfficeParserConfig with each 
call.
* You can reuse the AutoDetectParser, and can use it in multiple threads
* Always, always use TikaInputStream.get() when possible.  If you are handling 
files, it is most efficient to use TikaInputStream.get(Path p, Metadata m).

> Tika addes extra characters at the end of text in extracting from excel file
> 
>
> Key: TIKA-2734
> URL: https://issues.apache.org/jira/browse/TIKA-2734
> Project: Tika
>  Issue Type: Bug
>  Components: handler
>Affects Versions: 1.18
>Reporter: feng ye
>Priority: Major
> Attachments: AIRPORTSOK.xls, extra_A_Page_P.png
>
>
> when extracting text from some relatively large excel files (9000 rows or 
> so), I found an extra string of " PAGE " is added to the end of the 
> resulting text, when Tika.parseToString is called. Is it a known issue? Is 
> there any configuration that I can do that will opt out from outputting these 
> extra characters?
> did not find a good answer over google. 
> the input excel spreadsheet is attached. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)