Re: Resource Sharing Tika Corpus with Any23

2018-11-30 Thread Lewis John McGibbney
Hi Tim,
Thanks for the reply... answer inline

On 2018/11/30 19:22:23, Tim Allison  wrote: 
> I think that'd be great.  Some questions:
> 
> 1) Would you use the same input docs that we're using or would you
> need/want a new TB drive for your input/output?  

The same docs I suspect. We *could* contribute the documents we use in our test 
suite as well
https://github.com/apache/any23/tree/master/test-resources/src/test/resources
however this is not really necessary for us to run Any23. Any23 will only 
attempt extractions on a small subset of the documents in the corpus.

> How much space will
> you need for your eval framework including outputs?

I wouldn't imagine any more than maybe 5GB disk space in all. Any23 has the 
ability to run Open Information Extraction (smart relationship extraction from 
text) and this tends to generate more triples. If we decided to turn this on, 
then it would probably get towards the 5GB mark. I wouldnt imagine any more 
than that thought Tim.

> 2) Would you be willing to coordinate with us and PDFBox and POI
> around release times?

I think so yes. If anything this would be an excellent thing for Any23. I think 
improved coordination and communication between the communities would be a very 
positive step.

> 3) Would you be running your processing every so often (around your
> releases) or would it be constant aside from our releases? 

Most likely the former. I am aware that the service is billed to someones 
(your) card. So we would be looking to do only what is polite and acceptable. 
Prior to releases e.g. during review of a release candidate would be really 
cool. 

>  I ask
> because I'd like @Tobias Ospelt to have cycles for his fuzzing work
> when we're not getting ready for a release.
> 

That sounds fine to me. 
Thank you for the response. 


Re: 1.20?

2018-11-30 Thread loompa
Hi,
On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:

> Dave,
>   Should I try to get the Docker plugin working again?
>

That would be great. I think I may have went down the wrong path building
an image at package time, as there doesn't seem to be an easy way to
publish it as an Apache labelled org on Dockerhub unless it builds from
source.

I have some time over the weekend, so could update to where I got to and
see what you think.

Cheers,
Dave


[jira] [Commented] (TIKA-2550) ToTextHandler includes element content

2018-11-30 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705393#comment-16705393
 ] 

Hudson commented on TIKA-2550:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1602 (See 
[https://builds.apache.org/job/Tika-trunk/1602/])
TIKA-2550 -- fix whitespace (tallison: 
[https://github.com/apache/tika/commit/4ae1a10ec3f44f5278a1b741f0ea795c3f664cb3])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/code/SourceCodeParserTest.java
TIKA-2550 -- prevent content in style/script elements from being written 
(tallison: 
[https://github.com/apache/tika/commit/6b5dd8bbe09eb099ec75846ec02391cbd32351c4])
* (edit) tika-core/src/main/java/org/apache/tika/sax/ToTextContentHandler.java
* (edit) CHANGES.txt
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/code/SourceCodeParserTest.java


> ToTextHandler includes  element content
> ---
>
> Key: TIKA-2550
> URL: https://issues.apache.org/jira/browse/TIKA-2550
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0.0, 1.20
>
>
> When using the ToTextHandler to process .java files, the  element 
> content is included, e.g.:
> {noformat}
> testFile
> code {
> color: rgb(0,0,0); font-family: monospace; font-size: 12px; white-space: 
> nowrap;
> }
> .java_plain {
> color: rgb(0,0,0);
> }
> .java_keyword {
> color: rgb(0,0,0); font-weight: bold;
> }
> .java_javadoc_tag {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: 
> italic; font-weight: bold;
> }
> h1 {
> font-family: sans-serif; font-size: 16pt; font-weight: bold; color: 
> rgb(0,0,0); background: rgb(210,210,210); border: solid 1px black; padding: 
> 5px; text-align: center;
> }
> .java_type {
> color: rgb(0,44,221);
> }
> .java_literal {
> color: rgb(188,0,0);
> }
> .java_javadoc_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: 
> italic;
> }
> .java_operator {
> color: rgb(0,124,31);
> }
> .java_separator {
> color: rgb(0,33,255);
> }
> .java_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247);
> }
> testFile/*
>  *  Compilation:  javac HelloWorld.java
>  *  Execution:java HelloWorld
>  *
>  *  Prints "Hello, World". By tradition, this is everyone's first program.
>  *
>  */
> public class HelloWorld {
> public static void main(String[] args) {
> System.out.println("Hello, World");
> }
> }
> {noformat}
> Is this what we want as the default behavior?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2550) ToTextHandler includes element content

2018-11-30 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705363#comment-16705363
 ] 

Hudson commented on TIKA-2550:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #355 (See 
[https://builds.apache.org/job/tika-2.x-windows/355/])
TIKA-2550 -- fix whitespace (tallison: rev 
4ae1a10ec3f44f5278a1b741f0ea795c3f664cb3)
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/code/SourceCodeParserTest.java
TIKA-2550 -- prevent content in style/script elements from being written 
(tallison: rev 6b5dd8bbe09eb099ec75846ec02391cbd32351c4)
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/code/SourceCodeParserTest.java
* (edit) CHANGES.txt
* (edit) tika-core/src/main/java/org/apache/tika/sax/ToTextContentHandler.java


> ToTextHandler includes  element content
> ---
>
> Key: TIKA-2550
> URL: https://issues.apache.org/jira/browse/TIKA-2550
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0.0, 1.20
>
>
> When using the ToTextHandler to process .java files, the  element 
> content is included, e.g.:
> {noformat}
> testFile
> code {
> color: rgb(0,0,0); font-family: monospace; font-size: 12px; white-space: 
> nowrap;
> }
> .java_plain {
> color: rgb(0,0,0);
> }
> .java_keyword {
> color: rgb(0,0,0); font-weight: bold;
> }
> .java_javadoc_tag {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: 
> italic; font-weight: bold;
> }
> h1 {
> font-family: sans-serif; font-size: 16pt; font-weight: bold; color: 
> rgb(0,0,0); background: rgb(210,210,210); border: solid 1px black; padding: 
> 5px; text-align: center;
> }
> .java_type {
> color: rgb(0,44,221);
> }
> .java_literal {
> color: rgb(188,0,0);
> }
> .java_javadoc_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: 
> italic;
> }
> .java_operator {
> color: rgb(0,124,31);
> }
> .java_separator {
> color: rgb(0,33,255);
> }
> .java_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247);
> }
> testFile/*
>  *  Compilation:  javac HelloWorld.java
>  *  Execution:java HelloWorld
>  *
>  *  Prints "Hello, World". By tradition, this is everyone's first program.
>  *
>  */
> public class HelloWorld {
> public static void main(String[] args) {
> System.out.println("Hello, World");
> }
> }
> {noformat}
> Is this what we want as the default behavior?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2550) ToTextHandler includes element content

2018-11-30 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705343#comment-16705343
 ] 

Hudson commented on TIKA-2550:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #134 (See 
[https://builds.apache.org/job/tika-branch-1x/134/])
TIKA-2550 -- prevent content from script/style elements to be written in 
(tallison: 
[https://github.com/apache/tika/commit/4d6bc01189abf40ab58c18428a01e06b076bb40a])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/code/SourceCodeParserTest.java
* (edit) tika-core/src/main/java/org/apache/tika/sax/ToTextContentHandler.java
* (edit) CHANGES.txt


> ToTextHandler includes  element content
> ---
>
> Key: TIKA-2550
> URL: https://issues.apache.org/jira/browse/TIKA-2550
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0.0, 1.20
>
>
> When using the ToTextHandler to process .java files, the  element 
> content is included, e.g.:
> {noformat}
> testFile
> code {
> color: rgb(0,0,0); font-family: monospace; font-size: 12px; white-space: 
> nowrap;
> }
> .java_plain {
> color: rgb(0,0,0);
> }
> .java_keyword {
> color: rgb(0,0,0); font-weight: bold;
> }
> .java_javadoc_tag {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: 
> italic; font-weight: bold;
> }
> h1 {
> font-family: sans-serif; font-size: 16pt; font-weight: bold; color: 
> rgb(0,0,0); background: rgb(210,210,210); border: solid 1px black; padding: 
> 5px; text-align: center;
> }
> .java_type {
> color: rgb(0,44,221);
> }
> .java_literal {
> color: rgb(188,0,0);
> }
> .java_javadoc_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: 
> italic;
> }
> .java_operator {
> color: rgb(0,124,31);
> }
> .java_separator {
> color: rgb(0,33,255);
> }
> .java_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247);
> }
> testFile/*
>  *  Compilation:  javac HelloWorld.java
>  *  Execution:java HelloWorld
>  *
>  *  Prints "Hello, World". By tradition, this is everyone's first program.
>  *
>  */
> public class HelloWorld {
> public static void main(String[] args) {
> System.out.println("Hello, World");
> }
> }
> {noformat}
> Is this what we want as the default behavior?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2550) ToTextHandler includes element content

2018-11-30 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2550.
---
   Resolution: Fixed
 Assignee: Tim Allison
Fix Version/s: 1.20
   2.0.0

> ToTextHandler includes  element content
> ---
>
> Key: TIKA-2550
> URL: https://issues.apache.org/jira/browse/TIKA-2550
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0.0, 1.20
>
>
> When using the ToTextHandler to process .java files, the  element 
> content is included, e.g.:
> {noformat}
> testFile
> code {
> color: rgb(0,0,0); font-family: monospace; font-size: 12px; white-space: 
> nowrap;
> }
> .java_plain {
> color: rgb(0,0,0);
> }
> .java_keyword {
> color: rgb(0,0,0); font-weight: bold;
> }
> .java_javadoc_tag {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: 
> italic; font-weight: bold;
> }
> h1 {
> font-family: sans-serif; font-size: 16pt; font-weight: bold; color: 
> rgb(0,0,0); background: rgb(210,210,210); border: solid 1px black; padding: 
> 5px; text-align: center;
> }
> .java_type {
> color: rgb(0,44,221);
> }
> .java_literal {
> color: rgb(188,0,0);
> }
> .java_javadoc_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: 
> italic;
> }
> .java_operator {
> color: rgb(0,124,31);
> }
> .java_separator {
> color: rgb(0,33,255);
> }
> .java_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247);
> }
> testFile/*
>  *  Compilation:  javac HelloWorld.java
>  *  Execution:java HelloWorld
>  *
>  *  Prints "Hello, World". By tradition, this is everyone's first program.
>  *
>  */
> public class HelloWorld {
> public static void main(String[] args) {
> System.out.println("Hello, World");
> }
> }
> {noformat}
> Is this what we want as the default behavior?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2776) Tika server child restart

2018-11-30 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705283#comment-16705283
 ] 

Hudson commented on TIKA-2776:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #133 (See 
[https://builds.apache.org/job/tika-branch-1x/133/])
TIKA-2776 -- improve documentation for -maxFiles (tallison: 
[https://github.com/apache/tika/commit/4141411773a321fe614167584d23e376c4dbcb3c])
* (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java


> Tika server child restart
> -
>
> Key: TIKA-2776
> URL: https://issues.apache.org/jira/browse/TIKA-2776
> Project: Tika
>  Issue Type: Bug
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.20
>
> Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml, 
> log4j_child.xml, man_tika.zip, tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned 
> here:
> _If the child process is in the process of shutting down, and it gets a new 
> request it will return 503 -- Service Unavailable. If the server times out on 
> a file, the client will receive an IOException from the closed socket. Note 
> that all other files that are being processed will end with an IOException 
> from a closed socket when the child process shuts down; e.g. if you send 
> three files to tika-server concurrently, and one of them causes a 
> catastrophic problem requiring the child to shut down, you won't be able to 
> tell which file caused the problems. In the future, we may implement a 
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2776) Tika server child restart

2018-11-30 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705281#comment-16705281
 ] 

Hudson commented on TIKA-2776:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1601 (See 
[https://builds.apache.org/job/Tika-trunk/1601/])
TIKA-2776 -- improve documentation for -maxFiles (tallison: 
[https://github.com/apache/tika/commit/a477d73ac56c169075b5c9ea66bf57be1f3dc672])
* (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java


> Tika server child restart
> -
>
> Key: TIKA-2776
> URL: https://issues.apache.org/jira/browse/TIKA-2776
> Project: Tika
>  Issue Type: Bug
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.20
>
> Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml, 
> log4j_child.xml, man_tika.zip, tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned 
> here:
> _If the child process is in the process of shutting down, and it gets a new 
> request it will return 503 -- Service Unavailable. If the server times out on 
> a file, the client will receive an IOException from the closed socket. Note 
> that all other files that are being processed will end with an IOException 
> from a closed socket when the child process shuts down; e.g. if you send 
> three files to tika-server concurrently, and one of them causes a 
> catastrophic problem requiring the child to shut down, you won't be able to 
> tell which file caused the problems. In the future, we may implement a 
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2776) Tika server child restart

2018-11-30 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705251#comment-16705251
 ] 

Hudson commented on TIKA-2776:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #354 (See 
[https://builds.apache.org/job/tika-2.x-windows/354/])
TIKA-2776 -- improve documentation for -maxFiles (tallison: rev 
a477d73ac56c169075b5c9ea66bf57be1f3dc672)
* (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java


> Tika server child restart
> -
>
> Key: TIKA-2776
> URL: https://issues.apache.org/jira/browse/TIKA-2776
> Project: Tika
>  Issue Type: Bug
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.20
>
> Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml, 
> log4j_child.xml, man_tika.zip, tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned 
> here:
> _If the child process is in the process of shutting down, and it gets a new 
> request it will return 503 -- Service Unavailable. If the server times out on 
> a file, the client will receive an IOException from the closed socket. Note 
> that all other files that are being processed will end with an IOException 
> from a closed socket when the child process shuts down; e.g. if you send 
> three files to tika-server concurrently, and one of them causes a 
> catastrophic problem requiring the child to shut down, you won't be able to 
> tell which file caused the problems. In the future, we may implement a 
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2776) Tika server child restart

2018-11-30 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705208#comment-16705208
 ] 

Tim Allison commented on TIKA-2776:
---

This caught me by surprise...I thought that I had left the default as -1 (no 
max), but I clearly set it to 100,000.

I _think_ my reasoning was that this is good jvm hygiene given the craziness 
some of our parsers can do to a jvm.  I readily admit that I don't have good 
data to support my decision aside from the reasoning that "we've had memory 
leaks before because of caching; we'll have them again."  I'd be willing to 
bump the default to a higher value, but I wouldn't want to turn it off.

You can avoid the restarts caused by HIT_MAX by setting it to -1 on the 
commandline.  

I'll update the documentation on the wiki and in the code.  Thank you!

  

> Tika server child restart
> -
>
> Key: TIKA-2776
> URL: https://issues.apache.org/jira/browse/TIKA-2776
> Project: Tika
>  Issue Type: Bug
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.20
>
> Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml, 
> log4j_child.xml, man_tika.zip, tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned 
> here:
> _If the child process is in the process of shutting down, and it gets a new 
> request it will return 503 -- Service Unavailable. If the server times out on 
> a file, the client will receive an IOException from the closed socket. Note 
> that all other files that are being processed will end with an IOException 
> from a closed socket when the child process shuts down; e.g. if you send 
> three files to tika-server concurrently, and one of them causes a 
> catastrophic problem requiring the child to shut down, you won't be able to 
> tell which file caused the problems. In the future, we may implement a 
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Resource Sharing Tika Corpus with Any23

2018-11-30 Thread Tim Allison
I think that'd be great.  Some questions:

1) Would you use the same input docs that we're using or would you
need/want a new TB drive for your input/output?  How much space will
you need for your eval framework including outputs?
2) Would you be willing to coordinate with us and PDFBox and POI
around release times?
3) Would you be running your processing every so often (around your
releases) or would it be constant aside from our releases?  I ask
because I'd like @Tobias Ospelt to have cycles for his fuzzing work
when we're not getting ready for a release.

Onward!

Cheers,

   Tim
On Fri, Nov 30, 2018 at 2:08 PM Lewis John Mcgibbney
 wrote:
>
> Hi dev@tika,
> Over at Any23 we have been discussing the prospect of running large scale
> jobs over a significant, challenging dataset, same as is done with Tika via
> Tika batch on the VM.
> Is there any possibility, a very small number of us from the Any23 team
> could access VM and the dataset(s)? If the answer is yes, we will move
> ahead with building a test suite?
> Thank you for your consideration dev@tika,
> Lewis
>
> --
>
> *Lewis*
> Dr. Lewis J. McGibbney Ph.D, B.Sc
> *Skype*: lewis.john.mcgibbney


Resource Sharing Tika Corpus with Any23

2018-11-30 Thread Lewis John Mcgibbney
Hi dev@tika,
Over at Any23 we have been discussing the prospect of running large scale
jobs over a significant, challenging dataset, same as is done with Tika via
Tika batch on the VM.
Is there any possibility, a very small number of us from the Any23 team
could access VM and the dataset(s)? If the answer is yes, we will move
ahead with building a test suite?
Thank you for your consideration dev@tika,
Lewis

-- 

*Lewis*
Dr. Lewis J. McGibbney Ph.D, B.Sc
*Skype*: lewis.john.mcgibbney


[jira] [Commented] (TIKA-2727) Parsing and detect mime type of XML file stuck in infinite loop

2018-11-30 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705058#comment-16705058
 ] 

Tim Allison commented on TIKA-2727:
---

CVE-2018-11796

http://tika.apache.org/security.html

> Parsing and detect mime type of XML file stuck in infinite loop
> ---
>
> Key: TIKA-2727
> URL: https://issues.apache.org/jira/browse/TIKA-2727
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.17
>Reporter: Slava G
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.19.1
>
> Attachments: 1_6e4b115e-7d2d-45f1-a842-35b5ad7ba559, 
> 1_e3e13f0e-7085-4000-a558-5d255ed7a944.xml
>
>
> Hi,
> I'm trying to parse (even mime type detect) some XML file that it's not 
> large, but kinda tricky and my process hangs on :
> XMLStringBuffer.append(char[], int, int) line: not available 
> XMLStringBuffer.append(XMLString) line: not available 
> XMLNSDocumentScannerImpl(XMLScanner).scanAttributeValue(XMLString, XMLString, 
> String, boolean, String) line: not available 
> XMLNSDocumentScannerImpl.scanAttribute(XMLAttributesImpl) line: not available 
> XMLNSDocumentScannerImpl.scanStartElement() line: not available 
> XMLNSDocumentScannerImpl$NSContentDispatcher.scanRootElementHook() line: not 
> available 
> XMLNSDocumentScannerImpl$NSContentDispatcher(XMLDocumentFragmentScannerImpl$FragmentContentDispatcher).dispatch(boolean)
>  line: not available 
> XMLNSDocumentScannerImpl(XMLDocumentFragmentScannerImpl).scanDocument(boolean)
>  line: not available 
> XIncludeAwareParserConfiguration(XML11Configuration).parse(boolean) line: not 
> available 
> XIncludeAwareParserConfiguration(XML11Configuration).parse(XMLInputSource) 
> line: not available 
> SAXParserImpl$JAXPSAXParser(XMLParser).parse(XMLInputSource) line: not 
> available 
> SAXParserImpl$JAXPSAXParser(AbstractSAXParser).parse(InputSource) line: not 
> available 
> SAXParserImpl$JAXPSAXParser.parse(InputSource) line: not available 
> SAXParserImpl.parse(InputSource, DefaultHandler) line: not available 
> SAXParserImpl(SAXParser).parse(InputStream, DefaultHandler) line: 195 
> XmlRootExtractor.extractRootElement(InputStream) line: 62 
> XmlRootExtractor.extractRootElement(byte[]) line: 42 
> MimeTypes.getMimeType(byte[]) line: 212 
> MimeTypes.detect(InputStream, Metadata) line: 494 
> DefaultDetector(CompositeDetector).detect(InputStream, Metadata) line: 84
>  
> Please see attached XML file.
> Please advise.
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2727) Parsing and detect mime type of XML file stuck in infinite loop

2018-11-30 Thread David Dillard (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705055#comment-16705055
 ] 

David Dillard commented on TIKA-2727:
-

Any plans to get a CVE for this issue?  A hang sounds like a Denial of Service 
to me.

> Parsing and detect mime type of XML file stuck in infinite loop
> ---
>
> Key: TIKA-2727
> URL: https://issues.apache.org/jira/browse/TIKA-2727
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.17
>Reporter: Slava G
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.19.1
>
> Attachments: 1_6e4b115e-7d2d-45f1-a842-35b5ad7ba559, 
> 1_e3e13f0e-7085-4000-a558-5d255ed7a944.xml
>
>
> Hi,
> I'm trying to parse (even mime type detect) some XML file that it's not 
> large, but kinda tricky and my process hangs on :
> XMLStringBuffer.append(char[], int, int) line: not available 
> XMLStringBuffer.append(XMLString) line: not available 
> XMLNSDocumentScannerImpl(XMLScanner).scanAttributeValue(XMLString, XMLString, 
> String, boolean, String) line: not available 
> XMLNSDocumentScannerImpl.scanAttribute(XMLAttributesImpl) line: not available 
> XMLNSDocumentScannerImpl.scanStartElement() line: not available 
> XMLNSDocumentScannerImpl$NSContentDispatcher.scanRootElementHook() line: not 
> available 
> XMLNSDocumentScannerImpl$NSContentDispatcher(XMLDocumentFragmentScannerImpl$FragmentContentDispatcher).dispatch(boolean)
>  line: not available 
> XMLNSDocumentScannerImpl(XMLDocumentFragmentScannerImpl).scanDocument(boolean)
>  line: not available 
> XIncludeAwareParserConfiguration(XML11Configuration).parse(boolean) line: not 
> available 
> XIncludeAwareParserConfiguration(XML11Configuration).parse(XMLInputSource) 
> line: not available 
> SAXParserImpl$JAXPSAXParser(XMLParser).parse(XMLInputSource) line: not 
> available 
> SAXParserImpl$JAXPSAXParser(AbstractSAXParser).parse(InputSource) line: not 
> available 
> SAXParserImpl$JAXPSAXParser.parse(InputSource) line: not available 
> SAXParserImpl.parse(InputSource, DefaultHandler) line: not available 
> SAXParserImpl(SAXParser).parse(InputStream, DefaultHandler) line: 195 
> XmlRootExtractor.extractRootElement(InputStream) line: 62 
> XmlRootExtractor.extractRootElement(byte[]) line: 42 
> MimeTypes.getMimeType(byte[]) line: 212 
> MimeTypes.detect(InputStream, Metadata) line: 494 
> DefaultDetector(CompositeDetector).detect(InputStream, Metadata) line: 84
>  
> Please see attached XML file.
> Please advise.
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


JDK 12 build 22 is now available at : - jdk.java.net/12/

2018-11-30 Thread Rory O'Donnell

Hi Tim,

*NOTE:- *The JDK 12 schedule  
rampdown phase 1 of the release is coming up in a few weeks on Dec. 13, 
2018.


**

*JDK 12 Early Access build 22 **is now available **at : - jdk.java.net/12/*

 * Release Note updates since last email *
   *
 o Build 21 - Deprecating the default keytool -keyalg value
   (JDK-8212003)
 o Build 21 - Change to X25519 and X448 encoded private key format
   (JDK-8213363)
 o Build 20 - New command-line flag for more extensive error
   reporting in crash logs  (JDK-8211845)
 o Build 20 -Initial Value of user.timezone System Property Changed
   (JDK-8185496)

 * JEPs proposed for JDK 12 :
 o JEP 189: Shenandoah: A Low-Pause-Time Garbage
   Collector(Experimental) 
 o JEP 334: JVM Constants API 
 o JEP 344: Abortable Mixed Collections for G1
   
 o JEP 346: Promptly Return Unused Committed Memory from G1
   

 * JEPs targeted to JDK 12, so far
 o JEP 230: Microbenchmark Suite 
 o JEP 325: Switch Expressions (Preview)
   
 o JEP 326: Raw String Literals (Preview)
   
 o JEP 340: One AArch64 Port, Not Two
   
 o JEP 341: Default CDS Archives 

Rgds,Rory

--
Rgds,Rory O'Donnell
Quality Engineering Manager
Oracle EMEA , Dublin, Ireland



[jira] [Commented] (TIKA-2776) Tika server child restart

2018-11-30 Thread Mario Bisonti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704474#comment-16704474
 ] 

Mario Bisonti commented on TIKA-2776:
-

Hallo Tim.

I obtained a restart of child:

2018-11-30 01:21:01 INFO TikaServerWatchDog:104 - About to restart the child 
process
2018-11-30 01:21:02 INFO TikaServerWatchDog:106 - Successfully restarted child 
process -- 11 restarts so far)
2018-11-30 05:39:09 WARN TikaServerWatchDog:253 - Received status from child: 
HIT_MAX
2018-11-30 05:39:10 INFO TikaServerWatchDog:104 - About to restart the child 
process
2018-11-30 05:39:12 INFO TikaServerWatchDog:106 - Successfully restarted child 
process -- 13 restarts so far)
2018-11-30 08:38:03 WARN TikaServerWatchDog:253 - Received status from child: 
HIT_MAX
2018-11-30 08:38:03 INFO TikaServerWatchDog:104 - About to restart the child 
process
2018-11-30 08:38:04 INFO TikaServerWatchDog:106 - Successfully restarted child 
process -- 15 restarts so far)

 

Is this related about the parameter :

_{{-maxFiles}}: restart the child process after it has processed {{maxFiles}}. 
If there is a slow building memory leak, this restart of the JVM should help._

I didn't set the parameter.

Which is default value of maxFiles ?

Thanks

 

Mario

 

> Tika server child restart
> -
>
> Key: TIKA-2776
> URL: https://issues.apache.org/jira/browse/TIKA-2776
> Project: Tika
>  Issue Type: Bug
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.20
>
> Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml, 
> log4j_child.xml, man_tika.zip, tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned 
> here:
> _If the child process is in the process of shutting down, and it gets a new 
> request it will return 503 -- Service Unavailable. If the server times out on 
> a file, the client will receive an IOException from the closed socket. Note 
> that all other files that are being processed will end with an IOException 
> from a closed socket when the child process shuts down; e.g. if you send 
> three files to tika-server concurrently, and one of them causes a 
> catastrophic problem requiring the child to shut down, you won't be able to 
> tell which file caused the problems. In the future, we may implement a 
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)