[jira] [Commented] (TIKA-3235) Build failure caused by timeouts in XMLReaderUtils

2020-11-24 Thread Kenneth William Krugler (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238388#comment-17238388
 ] 

Kenneth William Krugler commented on TIKA-3235:
---

It still happens on occasion, but not every time. I'd say let's not block the 
release for this, but I think it would be useful to dump out more debugging 
info when a parser request times out, to figure out why none are available.

> Build failure caused by timeouts in XMLReaderUtils
> --
>
> Key: TIKA-3235
> URL: https://issues.apache.org/jira/browse/TIKA-3235
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> [~kkrugler] was not able to build 1.25-rc1 because of timeouts from 
> XMLReaderUtils.  Let's use this issue to figure out what's going wrong.
> {noformat}
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.xml.sax.SAXException: Waited more than 5 minutes for a
> >> SAXParser; This could indicate that a parser has not correctly released its
> >> SAXParser. Please report this to the Tika team: dev@tika.apache.org
> >> 
> >>at
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.apache.tika.exception.TikaException: Waited more than 5
> >> minutes for a SAXParser; This could indicate that a parser has not
> >> correctly released its SAXParser. Please report this to the Tika team:
> >> dev@tika.apache.org 
> >>at
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [EXTERNAL] Tika - Issues extracting Arabic script

2020-11-24 Thread Tim Allison
Cc’ing PDFBox

On Tue, Nov 24, 2020 at 1:18 PM Chris Mattmann  wrote:

> Christian thank you for reaching out. I am copying dev@tika.apache.org as
> I think your question is best directed there since tika python is
> downstream
> of the processing that happens there.
>
>
>
> Best of luck!
>
>
>
> Cheers
>
> Chris
>
>
>
>
>
> From: Christian Faggionato 
> Date: Tuesday, November 24, 2020 at 10:10 AM
> To: "Mattmann, Chris A (US 1740)" 
> Subject: [EXTERNAL] Tika - Issues extracting Arabic script
>
>
>
> Dear Chris,
>
> I am Christian Faggionato, research fellow at the School or Oriental and
> African Studies, University of London. At the moment I’m working on
> building a corpus of Uyghur texts and some of the content is coming from
> pdf files. I wrote a short python script to scrape text from pdf using
> tika-python. The script is Arabic, and the output looks good but there is
> one major problem: there are many missing spaces between words and I really
> do not know how to address this issue. Do you have any suggestions in these
> regards?
>
> I am attaching a pdf file and the script I wrote in case you would like to
> check it. Thanks in advance for your help,
>
> Best
>
> Christian.
>
> --
>
> Phd, Post-Doctoral Fellow
>
> Department of Religions and Philosophies
>
> Room 339
>
> SOAS University of London
> Thornhaugh Street
>
> London, WC1H 0XG
>
> c...@soas.ac.uk
>
>
>
>


Re: [EXTERNAL] Tika - Issues extracting Arabic script

2020-11-24 Thread Chris Mattmann
Christian thank you for reaching out. I am copying dev@tika.apache.org as 
I think your question is best directed there since tika python is downstream 
of the processing that happens there.

 

Best of luck!

 

Cheers

Chris

 

 

From: Christian Faggionato 
Date: Tuesday, November 24, 2020 at 10:10 AM
To: "Mattmann, Chris A (US 1740)" 
Subject: [EXTERNAL] Tika - Issues extracting Arabic script

 

Dear Chris, 

I am Christian Faggionato, research fellow at the School or Oriental and 
African Studies, University of London. At the moment I’m working on building a 
corpus of Uyghur texts and some of the content is coming from pdf files. I 
wrote a short python script to scrape text from pdf using tika-python. The 
script is Arabic, and the output looks good but there is one major problem: 
there are many missing spaces between words and I really do not know how to 
address this issue. Do you have any suggestions in these regards? 

I am attaching a pdf file and the script I wrote in case you would like to 
check it. Thanks in advance for your help, 

Best

Christian.

-- 

Phd, Post-Doctoral Fellow

Department of Religions and Philosophies

Room 339

SOAS University of London
Thornhaugh Street

London, WC1H 0XG

c...@soas.ac.uk

 



[jira] [Commented] (TIKA-3235) Build failure caused by timeouts in XMLReaderUtils

2020-11-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238233#comment-17238233
 ] 

Tim Allison commented on TIKA-3235:
---

The other bit of unhappiness I have around this part of the code base is that 
some folks will get different parses depending on their environment and which 
XML parsers are default/selected.  Part of me wants to lock down the XML parser 
to xerces2 or anything that we can rely on and is performant and secure.  This 
would improve consistency quite a bit.

> Build failure caused by timeouts in XMLReaderUtils
> --
>
> Key: TIKA-3235
> URL: https://issues.apache.org/jira/browse/TIKA-3235
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> [~kkrugler] was not able to build 1.25-rc1 because of timeouts from 
> XMLReaderUtils.  Let's use this issue to figure out what's going wrong.
> {noformat}
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.xml.sax.SAXException: Waited more than 5 minutes for a
> >> SAXParser; This could indicate that a parser has not correctly released its
> >> SAXParser. Please report this to the Tika team: dev@tika.apache.org
> >> 
> >>at
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.apache.tika.exception.TikaException: Waited more than 5
> >> minutes for a SAXParser; This could indicate that a parser has not
> >> correctly released its SAXParser. Please report this to the Tika team:
> >> dev@tika.apache.org 
> >>at
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3235) Build failure caused by timeouts in XMLReaderUtils

2020-11-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238232#comment-17238232
 ] 

Tim Allison commented on TIKA-3235:
---

I'm really unhappy with the hideous workaround we had to do to handle a 
SAXParser that reset on the first call but then not after that...

> Build failure caused by timeouts in XMLReaderUtils
> --
>
> Key: TIKA-3235
> URL: https://issues.apache.org/jira/browse/TIKA-3235
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> [~kkrugler] was not able to build 1.25-rc1 because of timeouts from 
> XMLReaderUtils.  Let's use this issue to figure out what's going wrong.
> {noformat}
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.xml.sax.SAXException: Waited more than 5 minutes for a
> >> SAXParser; This could indicate that a parser has not correctly released its
> >> SAXParser. Please report this to the Tika team: dev@tika.apache.org
> >> 
> >>at
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.apache.tika.exception.TikaException: Waited more than 5
> >> minutes for a SAXParser; This could indicate that a parser has not
> >> correctly released its SAXParser. Please report this to the Tika team:
> >> dev@tika.apache.org 
> >>at
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3235) Build failure caused by timeouts in XMLReaderUtils

2020-11-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238231#comment-17238231
 ] 

Tim Allison edited comment on TIKA-3235 at 11/24/20, 3:56 PM:
--

The impetus was Nutch/Sebastian Nagel (TIKA-2645/NUTCH-2578) who diagnosed a 
substantial thread lock issue when creating new SAXParsers (outside of Tika's 
code).  I'd be happy to get rid of the complexity we have in Tika, but I don't 
want users to have add their own providers to avoid this surprising thread lock 
issue.

If we can simplify our code, I'm all for it.


was (Author: talli...@mitre.org):
The impetus was Nutch/Sebastian Nagel (TIKA-2645/NUTCH-2578) who diagnosed a 
substantial thread lock issue when creating new SAXParsers (outside of Tika's 
code).  I'd be happy to get rid of the complexity we have in Tika, but I don't 
want users to have add there own providers to avoid this surprising threading 
issue.

If we can simplify our code, I'm all for it.

> Build failure caused by timeouts in XMLReaderUtils
> --
>
> Key: TIKA-3235
> URL: https://issues.apache.org/jira/browse/TIKA-3235
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> [~kkrugler] was not able to build 1.25-rc1 because of timeouts from 
> XMLReaderUtils.  Let's use this issue to figure out what's going wrong.
> {noformat}
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.xml.sax.SAXException: Waited more than 5 minutes for a
> >> SAXParser; This could indicate that a parser has not correctly released its
> >> SAXParser. Please report this to the Tika team: dev@tika.apache.org
> >> 
> >>at
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.apache.tika.exception.TikaException: Waited more than 5
> >> minutes for a SAXParser; This could indicate that a parser has not
> >> correctly released its SAXParser. Please report this to the Tika team:
> >> dev@tika.apache.org 
> >>at
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3222) TIKA generates not well formed structured text result for ODP files

2020-11-24 Thread Andreas Hirtzel (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Hirtzel updated TIKA-3222:
--
Summary: TIKA generates not well formed structured text result for ODP 
files  (was: TIKA generated not well formed structured text result for ODP 
files)

> TIKA generates not well formed structured text result for ODP files
> ---
>
> Key: TIKA-3222
> URL: https://issues.apache.org/jira/browse/TIKA-3222
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24.1
>Reporter: Andreas Hirtzel
>Priority: Critical
> Attachments: Test.odp
>
>
> Tika generates not well-formed content in the body tag of a XHTML output for 
> ODP files. I already checked the content files inside the ODP file. They all 
> look good.
> It is very simple to reproduce the effect. You just need to create a new and 
> simple ODP file. I have a sample file attached the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3235) Build failure caused by timeouts in XMLReaderUtils

2020-11-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238231#comment-17238231
 ] 

Tim Allison commented on TIKA-3235:
---

The impetus was Nutch/Sebastian Nagel (TIKA-2645/NUTCH-2578) who diagnosed a 
substantial thread lock issue when creating new SAXParsers (outside of Tika's 
code).  I'd be happy to get rid of the complexity we have in Tika, but I don't 
want users to have add there own providers to avoid this surprising threading 
issue.

If we can simplify our code, I'm all for it.

> Build failure caused by timeouts in XMLReaderUtils
> --
>
> Key: TIKA-3235
> URL: https://issues.apache.org/jira/browse/TIKA-3235
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> [~kkrugler] was not able to build 1.25-rc1 because of timeouts from 
> XMLReaderUtils.  Let's use this issue to figure out what's going wrong.
> {noformat}
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.xml.sax.SAXException: Waited more than 5 minutes for a
> >> SAXParser; This could indicate that a parser has not correctly released its
> >> SAXParser. Please report this to the Tika team: dev@tika.apache.org
> >> 
> >>at
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.apache.tika.exception.TikaException: Waited more than 5
> >> minutes for a SAXParser; This could indicate that a parser has not
> >> correctly released its SAXParser. Please report this to the Tika team:
> >> dev@tika.apache.org 
> >>at
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3235) Build failure caused by timeouts in XMLReaderUtils

2020-11-24 Thread Kenneth William Krugler (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238215#comment-17238215
 ] 

Kenneth William Krugler commented on TIKA-3235:
---

I took a quick look at the XMLReaderUtils.java code. I'm not much of a fan of 
static class variables, they seem to often lead to problems, but I didn't see 
anything that looked like an obvious problem.

I'm curious though why we wouldn't implement this via a "SAX parser provider" 
class that's part of the Tika context, where you could provide a caching 
implementation or something very simple.

> Build failure caused by timeouts in XMLReaderUtils
> --
>
> Key: TIKA-3235
> URL: https://issues.apache.org/jira/browse/TIKA-3235
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> [~kkrugler] was not able to build 1.25-rc1 because of timeouts from 
> XMLReaderUtils.  Let's use this issue to figure out what's going wrong.
> {noformat}
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.xml.sax.SAXException: Waited more than 5 minutes for a
> >> SAXParser; This could indicate that a parser has not correctly released its
> >> SAXParser. Please report this to the Tika team: dev@tika.apache.org
> >> 
> >>at
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.apache.tika.exception.TikaException: Waited more than 5
> >> minutes for a SAXParser; This could indicate that a parser has not
> >> correctly released its SAXParser. Please report this to the Tika team:
> >> dev@tika.apache.org 
> >>at
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3235) Build failure caused by timeouts in XMLReaderUtils

2020-11-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238115#comment-17238115
 ] 

Tim Allison commented on TIKA-3235:
---

Let me know if there's further work on this before I roll 1.25-rc2.

> Build failure caused by timeouts in XMLReaderUtils
> --
>
> Key: TIKA-3235
> URL: https://issues.apache.org/jira/browse/TIKA-3235
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> [~kkrugler] was not able to build 1.25-rc1 because of timeouts from 
> XMLReaderUtils.  Let's use this issue to figure out what's going wrong.
> {noformat}
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.xml.sax.SAXException: Waited more than 5 minutes for a
> >> SAXParser; This could indicate that a parser has not correctly released its
> >> SAXParser. Please report this to the Tika team: dev@tika.apache.org
> >> 
> >>at
> >> org.apache.tika.parser.microsoft.ooxml.OOXMLParserTest.testUnsupportedPowerPoint(OOXMLParserTest.java:341)
> >> Caused by: org.apache.tika.exception.TikaException: Waited more than 5
> >> minutes for a SAXParser; This could indicate that a parser has not
> >> correctly released its SAXParser. Please report this to the Tika team:
> >> dev@tika.apache.org 
> >>at
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3236) Upgrade cxf-core to 3.3.8

2020-11-24 Thread Jira
Jesper Håsteen created TIKA-3236:


 Summary: Upgrade cxf-core to 3.3.8
 Key: TIKA-3236
 URL: https://issues.apache.org/jira/browse/TIKA-3236
 Project: Tika
  Issue Type: Task
  Components: parser
Affects Versions: 1.24.1
Reporter: Jesper Håsteen


cxf-core has a known vulnerability in 3.3.6.

https://nvd.nist.gov/vuln/detail/CVE-2020-13954



--
This message was sent by Atlassian Jira
(v8.3.4#803005)