date:20150401


 [ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov updated TIKA-1590:

Fix Version/s: 1.8

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Fix For: 1.8

 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:179)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDFieldFactory.isButton(PDFieldFactory.java:157)
   at

[jira] [Comment Edited] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML


[ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390170#comment-14390170
 ] 

Konstantin Gribov edited comment on TIKA-1590 at 4/1/15 7:53 AM:
-

Fixed in trunk by update of pdfbox to 1.8.9. See alse TIKA-1575 and PDFBOX-2261.


was (Author: grossws):
Fixed in trunk by update of pdfbox to 1.8.9. See alse TIKA-1575 and PDFBOX-2710.

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Fix For: 1.8

 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at

[jira] [Commented] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML


[ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390177#comment-14390177
 ] 

Konstantin Gribov commented on TIKA-1590:
-

Thank you for the feedback, Matt. 

I think, it's the same problem, as was in PDFBOX-2261. It's currently fixed in 
trunk.

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Fix For: 1.8

 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at

Re: Access Control Allow Origin

2015-04-01 Thread Sergey Beryozkin


Hi Tyler

Sorry for a delay, I was off for the last few days,
The change you did looks fine, the filter can check the annotations or 
can be configured directly (which is what you did).
It might make sense to consider checking a (Java) properties resource as 
a possible future enhancement, as a CORS filter may have many properties,
May be if a '-cors' is provided then check a well-known class resource 
where all of the cors properties are set, if it is absent - default to 
'*' otherwise work with Properties...
The current approach works too, might be tricky to extend it to support 
more properties but great for a start


Thanks, Sergey




On 27/03/15 18:56, Tyler Palsulich wrote:

Thank you, Sergey! I didn't know about that feature. I am going to try to
work up a patch this weekend which enables CORS. I'll let you know if I run
into any issues.

Thanks again,
Tyler

On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:




++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Tyler Palsulich tpalsul...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Tuesday, March 24, 2015 at 3:41 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Access Control Allow Origin


Hi Folks,

I took a stab at creating an example website to submit a file to the form
resource of our VM. See http://tpalsulich.github.io/TikaExamples/.

If I try to use AJAX to submit the request to make the page prettier (see
the script in the head of the page (with ev.preventDefault() commented
out), I get the following error:

XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No
'Access-Control-Allow-Origin' header is present on the requested resource.
Origin 'http://tpalsulich.github.io' is therefore not allowed access. The
response had HTTP status code 400.

We can't allow the tika-server response header to accept * in general,
since that isn't secure. So, would there be interest in including this
sort
of site on the VM? Then, the AJAX request won't be external and we won't
have this error.

The version button just takes you to the version resource on the VM
(doesn't do anything with the file).

Tyler

[jira] [Commented] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML

2015-04-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390469#comment-14390469
 ] 

Tim Allison commented on TIKA-1590:
---

Not that this is needed, but I doubly confirmed that this file no longer causes 
a hang with Tika trunk and PDFBox 1.8.9.  Many thanks to [~tilman] and 
[~msahyoun] for fixing this!

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Fix For: 1.8

 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)

[jira] [Resolved] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML


 [ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-1590.
-
Resolution: Duplicate

Fixed in trunk by update of pdfbox to 1.8.9. See alse TIKA-1575 and PDFBOX-2710.

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Fix For: 1.8

 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:179)
   at

Re: Access Control Allow Origin

2015-04-01 Thread Tyler Palsulich

Thank you for the feedback!

I think there's an issue (don't remember the number) to be able to specify
a TikaConfig file for tika-server. So, I think that would be the ideal
place to put more complex CORS configuration.

Tyler

On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com
wrote:

 Hi Tyler

 Sorry for a delay, I was off for the last few days,
 The change you did looks fine, the filter can check the annotations or can
 be configured directly (which is what you did).
 It might make sense to consider checking a (Java) properties resource as a
 possible future enhancement, as a CORS filter may have many properties,
 May be if a '-cors' is provided then check a well-known class resource
 where all of the cors properties are set, if it is absent - default to '*'
 otherwise work with Properties...
 The current approach works too, might be tricky to extend it to support
 more properties but great for a start

 Thanks, Sergey





 On 27/03/15 18:56, Tyler Palsulich wrote:

 Thank you, Sergey! I didn't know about that feature. I am going to try to
 work up a patch this weekend which enables CORS. I'll let you know if I
 run
 into any issues.

 Thanks again,
 Tyler

 On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:



 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Tuesday, March 24, 2015 at 3:41 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Access Control Allow Origin

  Hi Folks,

 I took a stab at creating an example website to submit a file to the
 form
 resource of our VM. See http://tpalsulich.github.io/TikaExamples/.

 If I try to use AJAX to submit the request to make the page prettier
 (see
 the script in the head of the page (with ev.preventDefault() commented
 out), I get the following error:

 XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No
 'Access-Control-Allow-Origin' header is present on the requested
 resource.
 Origin 'http://tpalsulich.github.io' is therefore not allowed access.
 The
 response had HTTP status code 400.

 We can't allow the tika-server response header to accept * in general,
 since that isn't secure. So, would there be interest in including this
 sort
 of site on the VM? Then, the AJAX request won't be external and we won't
 have this error.

 The version button just takes you to the version resource on the VM
 (doesn't do anything with the file).

 Tyler

Re: Access Control Allow Origin

2015-04-01 Thread Tyler Palsulich

I'll change the option to -C right now. Just looked closer -- TIKA-1426 is
to provide a config for the server and app on the command line.

Tyler

On Wed, Apr 1, 2015 at 11:22 AM, Allison, Timothy B. talli...@mitre.org
wrote:

 Might be thinking of TIKA-944?

 Mind if we switch the CORS short option to -C and use -c for the tika
 config file?

 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Wednesday, April 01, 2015 11:13 AM
 To: dev@tika.apache.org
 Subject: Re: Access Control Allow Origin

 Thank you for the feedback!

 I think there's an issue (don't remember the number) to be able to specify
 a TikaConfig file for tika-server. So, I think that would be the ideal
 place to put more complex CORS configuration.

 Tyler

 On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com
 wrote:

  Hi Tyler
 
  Sorry for a delay, I was off for the last few days,
  The change you did looks fine, the filter can check the annotations or
 can
  be configured directly (which is what you did).
  It might make sense to consider checking a (Java) properties resource as
 a
  possible future enhancement, as a CORS filter may have many properties,
  May be if a '-cors' is provided then check a well-known class resource
  where all of the cors properties are set, if it is absent - default to
 '*'
  otherwise work with Properties...
  The current approach works too, might be tricky to extend it to support
  more properties but great for a start
 
  Thanks, Sergey
 
 
 
 
 
  On 27/03/15 18:56, Tyler Palsulich wrote:
 
  Thank you, Sergey! I didn't know about that feature. I am going to try
 to
  work up a patch this weekend which enables CORS. I'll let you know if I
  run
  into any issues.
 
  Thanks again,
  Tyler
 
  On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
 
 
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Tyler Palsulich tpalsul...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Tuesday, March 24, 2015 at 3:41 PM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Access Control Allow Origin
 
   Hi Folks,
 
  I took a stab at creating an example website to submit a file to the
  form
  resource of our VM. See http://tpalsulich.github.io/TikaExamples/.
 
  If I try to use AJAX to submit the request to make the page prettier
  (see
  the script in the head of the page (with ev.preventDefault() commented
  out), I get the following error:
 
  XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No
  'Access-Control-Allow-Origin' header is present on the requested
  resource.
  Origin 'http://tpalsulich.github.io' is therefore not allowed access.
  The
  response had HTTP status code 400.
 
  We can't allow the tika-server response header to accept * in
 general,
  since that isn't secure. So, would there be interest in including this
  sort
  of site on the VM? Then, the AJAX request won't be external and we
 won't
  have this error.
 
  The version button just takes you to the version resource on the VM
  (doesn't do anything with the file).
 
  Tyler

RE: Access Control Allow Origin

2015-04-01 Thread Allison, Timothy B.

Might be thinking of TIKA-944?

Mind if we switch the CORS short option to -C and use -c for the tika config 
file?

-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@gmail.com] 
Sent: Wednesday, April 01, 2015 11:13 AM
To: dev@tika.apache.org
Subject: Re: Access Control Allow Origin

Thank you for the feedback!

I think there's an issue (don't remember the number) to be able to specify
a TikaConfig file for tika-server. So, I think that would be the ideal
place to put more complex CORS configuration.

Tyler

On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com
wrote:

 Hi Tyler

 Sorry for a delay, I was off for the last few days,
 The change you did looks fine, the filter can check the annotations or can
 be configured directly (which is what you did).
 It might make sense to consider checking a (Java) properties resource as a
 possible future enhancement, as a CORS filter may have many properties,
 May be if a '-cors' is provided then check a well-known class resource
 where all of the cors properties are set, if it is absent - default to '*'
 otherwise work with Properties...
 The current approach works too, might be tricky to extend it to support
 more properties but great for a start

 Thanks, Sergey

 On 27/03/15 18:56, Tyler Palsulich wrote:

 Thank you, Sergey! I didn't know about that feature. I am going to try to
 work up a patch this weekend which enables CORS. I'll let you know if I
 run
 into any issues.

 Thanks again,
 Tyler

 On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++

 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Tuesday, March 24, 2015 at 3:41 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Access Control Allow Origin

  Hi Folks,

 I took a stab at creating an example website to submit a file to the
 form
 resource of our VM. See http://tpalsulich.github.io/TikaExamples/.

 If I try to use AJAX to submit the request to make the page prettier
 (see
 the script in the head of the page (with ev.preventDefault() commented
 out), I get the following error:

 XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No
 'Access-Control-Allow-Origin' header is present on the requested
 resource.
 Origin 'http://tpalsulich.github.io' is therefore not allowed access.
 The
 response had HTTP status code 400.

 We can't allow the tika-server response header to accept * in general,
 since that isn't secure. So, would there be interest in including this
 sort
 of site on the VM? Then, the AJAX request won't be external and we won't
 have this error.

 The version button just takes you to the version resource on the VM
 (doesn't do anything with the file).

 Tyler

[jira] [Comment Edited] (TIKA-1585) Create Example Website with Form Submission

2015-04-01 Thread Tyler Palsulich (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390841#comment-14390841
 ] 

Tyler Palsulich edited comment on TIKA-1585 at 4/1/15 3:51 PM:
---

Done. It works. -I'll see if I can shut 9997 down right now.- Port 9997 is now 
closed.


was (Author: tpalsulich):
Done. It works. I'll see if I can shut 9997 down right now.

 Create Example Website with Form Submission
 ---

 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 It would be great to have a website where we can direct people who ask what 
 Tika can do for [filetype] without needing them to actually download Tika.
 Some initial work to do that is 
 [here|http://tpalsulich.github.io/TikaExamples/].
 I'm far from a design guru, but I imagine the site as having a form where you 
 can upload a file at the top, checkboxes for if you want metadata, content, 
 or both, and a submit button. The request should be sent with AJAX and the 
 result should populate a {{div}}.
 One issue with AJAX requests is that Tika Server doesn't currently allow 
 Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a 
 slightly updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Tyler Palsulich

All tests are passing. Only issue I see is excessive logging. The Hudson
failure does just look like a hiccup.

Tyler

On Wed, Apr 1, 2015 at 2:55 PM, Allison, Timothy B. talli...@mitre.org
wrote:

 This looks like a Hudson hiccup.

 Tyler is seeing excessive logging:
 Running org.apache.tika.cli.TikaCLIBatchIntegrationTest
 INFO - about to start driver
 INFO - about to start driver

 Anyone else having problems building from a fresh trunk?


 -Original Message-
 From: Hudson (JIRA) [mailto:j...@apache.org]
 Sent: Wednesday, April 01, 2015 5:36 PM
 To: dev@tika.apache.org
 Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code


 [
 https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ]

 Hudson commented on TIKA-1330:
 --

 ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [
 https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
 TIKA-1330 flush stacktrace writers (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
 TIKA-1330 clean up logging in tika-batch ant tika-app integration of
 tika-batch, take 2 (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


  Add robust tika-batch code
  --
 
  Key: TIKA-1330
  URL: https://issues.apache.org/jira/browse/TIKA-1330
  Project: Tika
   Issue Type: Sub-task
   Components: cli, general, server
 Reporter: Tim Allison
 Assignee: Tim Allison
  Attachments: TIKA-1330v1-patch.zip
 
 
  In my current design plan, I see creating a separate component
 tika-batch that includes a small bit of configurable code to run Tika
 against a large batch of documents.  This code should be robust against OOM
 and hangs, and it should have fairly robust logging.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)

[jira] [Resolved] (TIKA-1591) Tika Parsers uses wrong version of bouncycastle


 [ 
https://issues.apache.org/jira/browse/TIKA-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-1591.
-
   Resolution: Fixed
Fix Version/s: 1.8

Updated in r1670802

 Tika Parsers uses wrong version of bouncycastle
 ---

 Key: TIKA-1591
 URL: https://issues.apache.org/jira/browse/TIKA-1591
 Project: Tika
  Issue Type: Bug
Reporter: Ben McCann
Assignee: Konstantin Gribov
 Fix For: 1.8


 Tika uses:
 dependency
   groupIdorg.bouncycastle/groupId
   artifactIdbcmail-jdk15/artifactId
   version1.45/version
 /dependency
 It's not recommended to use bcmail-jdk15, which is the old artifact name. 
 Instead bcmail-jdk15on should be used (latest version is 1.52)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/2/15 1:30 AM:
---

After some research, it looks like the algorithm design with probabilistic mime
type selection seems to cause some confusion vs Naive Bayesian, the idea is
borrowed from Naive Bayesian, but it turns out giving up on some properties of
Naive Bayesian seems to cause some confusions and therefore less intuitive
although some computations are skimped.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking the vote for which type should be used.
e.g.
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system in this case is expected to return
Byte-Stream as the final decision for the type detection when magic test trust
values(i.e. the presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior
might not be a good idea even though it simplifies a bit the computation, but
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose
prior is 50% percent of correctness, i.e. 50% of chance that the detected type
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some
space where the algorithm can improved and optimized.
Original design is intertwined with the consideration on computations which
seems to cause some confusion, but actually the causal reasoning and intuition
might be a bit more important, i will also be optimizing and correcting some of
the factors which seems to be less appropriate in the original design.

I am working on the improvement and conducting the research with pros and cons;
if any thoughts, please kindly let me know.

was (Author: lukeliush):
After some research, it looks like the algorithm design with probabilistic mime
type selection seems to cause some confusion vs Naive Bayesian, the idea is
borrowed from Naive Bayesian, but it turns out giving up on some properties of
Naive Bayesian seems to cause some confusions and therefore less intuitive
although some computations are skimped.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key:

[jira] [Commented] (TIKA-1330) Add robust tika-batch code


[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392027#comment-14392027
 ] 

Hudson commented on TIKA-1330:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #599 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/599/])
TIKA-1330 fix logging in TikaCLI to avoid adding multiple appenders (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670804)
* /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* /tika/trunk/tika-app/src/main/resources/log4j.properties


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1323) Improve exception reporting in JAX-RS server


[ 
https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392026#comment-14392026
 ] 

Hudson commented on TIKA-1323:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #599 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/599/])
TIKA-1323: flush writer when printing stack trace (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670807)
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerParseExceptionMapper.java


 Improve exception reporting in JAX-RS server
 

 Key: TIKA-1323
 URL: https://issues.apache.org/jira/browse/TIKA-1323
 Project: Tika
  Issue Type: Improvement
  Components: server
Reporter: Tim Allison
Priority: Minor

 I'd like to use tika-server for TIKA-1302.  As part of that, I'd like to 
 record exception stacktraces per document.  I see two options: transmit the 
 info back to the client (assuming a doc didn't bring the server down :) ) 
 along with the current error code or log the document id and stacktrace via 
 the server.  Given my current design thoughts, I'd prefer the first option.
 Any objections or recommendations?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

RE: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Allison, Timothy B.

On the duplication/triplication of 
 INFO - about to start driver
 INFO - about to start driver

was because main() adds a new appender with .configure()...so subsequent calls 
to main() in the tests were adding more appenders.  I just fixed that in 
r1670804.

What I can't figure out is why you're seeing anything.  I've redirected both 
stdout and stderr in setup() (annotated @Before) to ByteArrayOutputStreams.

If setup() weren't being called, you'd get NPEs for each of the four tests, so 
setup() must be getting calledh

-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@gmail.com] 
Sent: Wednesday, April 01, 2015 7:39 PM
To: dev@tika.apache.org
Subject: Re: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

All tests are passing. Only issue I see is excessive logging. The Hudson
failure does just look like a hiccup.

Tyler

On Wed, Apr 1, 2015 at 2:55 PM, Allison, Timothy B. talli...@mitre.org
wrote:

 This looks like a Hudson hiccup.

 Tyler is seeing excessive logging:
 Running org.apache.tika.cli.TikaCLIBatchIntegrationTest
 INFO - about to start driver
 INFO - about to start driver

 Anyone else having problems building from a fresh trunk?


 -Original Message-
 From: Hudson (JIRA) [mailto:j...@apache.org]
 Sent: Wednesday, April 01, 2015 5:36 PM
 To: dev@tika.apache.org
 Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code


 [
 https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ]

 Hudson commented on TIKA-1330:
 --

 ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [
 https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
 TIKA-1330 flush stacktrace writers (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
 TIKA-1330 clean up logging in tika-batch ant tika-app integration of
 tika-batch, take 2 (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


  Add robust tika-batch code
  --
 
  Key: TIKA-1330
  URL: https://issues.apache.org/jira/browse/TIKA-1330
  Project: Tika
   Issue Type: Sub-task
   Components: cli, general, server
 Reporter: Tim Allison
 Assignee: Tim Allison
  Attachments: TIKA-1330v1-patch.zip
 
 
  In my current design plan, I see creating a separate component
 tika-batch that includes a small bit of configurable code to run Tika
 against a large batch of documents.  This code should be robust against OOM
 and hangs, and it should have fairly robust logging.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)

[jira] [Commented] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML

2015-04-01 Thread Matt Sheppard (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392196#comment-14392196
 ] 

Matt Sheppard commented on TIKA-1590:
-

Great, thanks - Looking forward to 1.8!

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Fix For: 1.8

 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:179)
   at

[jira] [Commented] (TIKA-1330) Add robust tika-batch code


[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391255#comment-14391255
 ] 

Hudson commented on TIKA-1330:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #595 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/595/])
TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch 
(tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670749)
* 
/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java
* /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* /tika/trunk/tika-app/src/main/resources/log4j_batch_process.properties
* 
/tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchIntegrationTest.java
* /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
* /tika/trunk/tika-app/src/test/resources/log4j_batch_process_test.properties
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcess.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParallelFileProcessingResult.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/AbstractFSConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/BasicTikaFSConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java
* /tika/trunk/tika-batch/src/test/resources/log4j.properties
* /tika/trunk/tika-batch/src/test/resources/log4j_process.properties


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server


[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391050#comment-14391050
 ] 

Hudson commented on TIKA-1586:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #594 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/594/])
TIKA-1586. Change CORS short option to -C. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670683)
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java


 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

RE: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Allison, Timothy B.

This looks like a Hudson hiccup.

Tyler is seeing excessive logging:
Running org.apache.tika.cli.TikaCLIBatchIntegrationTest
INFO - about to start driver
INFO - about to start driver

Anyone else having problems building from a fresh trunk?


-Original Message-
From: Hudson (JIRA) [mailto:j...@apache.org] 
Sent: Wednesday, April 01, 2015 5:36 PM
To: dev@tika.apache.org
Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code


[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ] 

Hudson commented on TIKA-1330:
--

ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
TIKA-1330 flush stacktrace writers (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
TIKA-1330 clean up logging in tika-batch ant tika-app integration of 
tika-batch, take 2 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1517) MIME type selection with probability

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh commented on TIKA-1517:
---

After some research, it looks like the algorithm design with probabilistic mime
type selection seems to be cause some confusion vs Naive Bayesian, the idea is
borrowed from Naive Bayesian, but it turns out giving up on some properties of
Naive Bayesian seems to cause some confusions and therefore less intuitive
although some computations are skimped.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
should we take byte-stream as part of the decisions. After thinking about this
this week for some time, i decide to ignore this byte-stream predicted by a
test when taking vote for decision on which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

After thinking about this problem for a while, it seems there are still some
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which
seems to cause some confusion, but actually the causal reasoning and intuition
might be a bit more important, i will also be optimizing and correcting some of
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project: Tika
Issue Type: Improvement
Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
Attachments: BaysianTest.java

Improvement and intuition
The original implementation for MIME type selection/detection is a bit less
flexible by initial design, as it heavily relies on the outcome produced by
magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable
in a file, Tika will follow the file type detected by magic-bytes. It may be
better to provide more control over the method of choice.
This proposed approach slightly incorporate the Bayesian probability theorem,
where users are able to assign weights to each approach in terms of
probability, so they have the control over preference of which file type or
mime type identification methods implemented/available in Tika, and currently
there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File
extension and Metadata content-type hint). By introducing some weights on the
approach in the proposed approach, users are able to choose which method they
trust most, the magic-bytes method is often trust-worthy though. But the
virtue is that in some situations, file type identification must be
sensitive, some might want all of the MIME type identification methods to
agree on the same file type before they start processing those files,
incorrect file type identification is less intolerable. The current
implementation seems to be less flexible for this purpose and heavily rely on
the Magic-bytes file identification method (although magic-bytes is most
reliable compared to the other 2 );
Proposed design:
The idea of selection is to incorporate probability as weights on each MIME
type identification method currently being implemented in Tika (they are
Magic bytes approach, file extension match and metadata content-type hint).
for example,
as an user, i would probably like to assign the the preference to the method
based on the degree of the trust, and order the results if they don't
coincide.
Bayesian rule may be a bit appropriate here to meet the intuition.
The following is what are needed for Bayesian rule implementation.
Prior probability

[jira] [Commented] (TIKA-1330) Add robust tika-batch code


[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ] 

Hudson commented on TIKA-1330:
--

ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
TIKA-1330 flush stacktrace writers (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
TIKA-1330 clean up logging in tika-batch ant tika-app integration of 
tika-batch, take 2 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1591) Tika Parsers uses wrong version of bouncycastle

2015-04-01 Thread Ben McCann (JIRA)

Ben McCann created TIKA-1591:


 Summary: Tika Parsers uses wrong version of bouncycastle
 Key: TIKA-1591
 URL: https://issues.apache.org/jira/browse/TIKA-1591
 Project: Tika
  Issue Type: Bug
Reporter: Ben McCann


Tika uses:

dependency
  groupIdorg.bouncycastle/groupId
  artifactIdbcmail-jdk15/artifactId
  version1.45/version
/dependency

It's not recommended to use bcmail-jdk15, which is the old artifact name. 
Instead bcmail-jdk15on should be used (latest version is 1.52)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/1/15 10:38 PM:

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking vote for decision on which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
should we take byte-stream as part of the decisions. After thinking about this
this week for some time, i decide to ignore this byte-stream predicted by a
test when taking vote for decision on which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project: Tika
Issue Type: Improvement
Components:

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/1/15 10:37 PM:

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
should we take byte-stream as part of the decisions. After thinking about this
this week for some time, i decide to ignore this byte-stream predicted by a
test when taking vote for decision on which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

was (Author: lukeliush):
After some research, it looks like the algorithm design with probabilistic mime
type selection seems to be cause some confusion vs Naive Bayesian, the idea is
borrowed from Naive Bayesian, but it turns out giving up on some properties of
Naive Bayesian seems to cause some confusions and therefore less intuitive
although some computations are skimped.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
should we take byte-stream as part of the decisions. After thinking about this
this week for some time, i decide to ignore this byte-stream predicted by a
test when taking vote for decision on which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project: Tika
Issue Type: Improvement
Components: mime

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM:

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking the vote for which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking vote for decision on which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project: Tika
Issue Type: Improvement
Components:

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM:

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking the vote for which type should be used.
e.g.
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking the vote for which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project: Tika
Issue Type: Improvement

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/1/15 10:41 PM:

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL:

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM:

I am working on the improvement; if any thoughts, please kindly let me know.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking the vote for which type should be used.
e.g.
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project:

[jira] [Updated] (TIKA-1517) MIME type selection with probability