[jira] [Updated] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML

2015-04-01 Thread Konstantin Gribov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov updated TIKA-1590:

Fix Version/s: 1.8

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Fix For: 1.8

 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:179)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDFieldFactory.isButton(PDFieldFactory.java:157)
   at 

[jira] [Comment Edited] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML

2015-04-01 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390170#comment-14390170
 ] 

Konstantin Gribov edited comment on TIKA-1590 at 4/1/15 7:53 AM:
-

Fixed in trunk by update of pdfbox to 1.8.9. See alse TIKA-1575 and PDFBOX-2261.


was (Author: grossws):
Fixed in trunk by update of pdfbox to 1.8.9. See alse TIKA-1575 and PDFBOX-2710.

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Fix For: 1.8

 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 

[jira] [Commented] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML

2015-04-01 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390177#comment-14390177
 ] 

Konstantin Gribov commented on TIKA-1590:
-

Thank you for the feedback, Matt. 

I think, it's the same problem, as was in PDFBOX-2261. It's currently fixed in 
trunk.

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Fix For: 1.8

 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 

Re: Access Control Allow Origin

2015-04-01 Thread Sergey Beryozkin

Hi Tyler

Sorry for a delay, I was off for the last few days,
The change you did looks fine, the filter can check the annotations or 
can be configured directly (which is what you did).
It might make sense to consider checking a (Java) properties resource as 
a possible future enhancement, as a CORS filter may have many properties,
May be if a '-cors' is provided then check a well-known class resource 
where all of the cors properties are set, if it is absent - default to 
'*' otherwise work with Properties...
The current approach works too, might be tricky to extend it to support 
more properties but great for a start


Thanks, Sergey




On 27/03/15 18:56, Tyler Palsulich wrote:

Thank you, Sergey! I didn't know about that feature. I am going to try to
work up a patch this weekend which enables CORS. I'll let you know if I run
into any issues.

Thanks again,
Tyler

On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:




++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Tyler Palsulich tpalsul...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Tuesday, March 24, 2015 at 3:41 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Access Control Allow Origin


Hi Folks,

I took a stab at creating an example website to submit a file to the form
resource of our VM. See http://tpalsulich.github.io/TikaExamples/.

If I try to use AJAX to submit the request to make the page prettier (see
the script in the head of the page (with ev.preventDefault() commented
out), I get the following error:

XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No
'Access-Control-Allow-Origin' header is present on the requested resource.
Origin 'http://tpalsulich.github.io' is therefore not allowed access. The
response had HTTP status code 400.

We can't allow the tika-server response header to accept * in general,
since that isn't secure. So, would there be interest in including this
sort
of site on the VM? Then, the AJAX request won't be external and we won't
have this error.

The version button just takes you to the version resource on the VM
(doesn't do anything with the file).

Tyler









[jira] [Commented] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML

2015-04-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390469#comment-14390469
 ] 

Tim Allison commented on TIKA-1590:
---

Not that this is needed, but I doubly confirmed that this file no longer causes 
a hang with Tika trunk and PDFBox 1.8.9.  Many thanks to [~tilman] and 
[~msahyoun] for fixing this!

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Fix For: 1.8

 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
  

[jira] [Resolved] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML

2015-04-01 Thread Konstantin Gribov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-1590.
-
Resolution: Duplicate

Fixed in trunk by update of pdfbox to 1.8.9. See alse TIKA-1575 and PDFBOX-2710.

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Fix For: 1.8

 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:179)
   at 
 

Re: Access Control Allow Origin

2015-04-01 Thread Tyler Palsulich
Thank you for the feedback!

I think there's an issue (don't remember the number) to be able to specify
a TikaConfig file for tika-server. So, I think that would be the ideal
place to put more complex CORS configuration.

Tyler

On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com
wrote:

 Hi Tyler

 Sorry for a delay, I was off for the last few days,
 The change you did looks fine, the filter can check the annotations or can
 be configured directly (which is what you did).
 It might make sense to consider checking a (Java) properties resource as a
 possible future enhancement, as a CORS filter may have many properties,
 May be if a '-cors' is provided then check a well-known class resource
 where all of the cors properties are set, if it is absent - default to '*'
 otherwise work with Properties...
 The current approach works too, might be tricky to extend it to support
 more properties but great for a start

 Thanks, Sergey





 On 27/03/15 18:56, Tyler Palsulich wrote:

 Thank you, Sergey! I didn't know about that feature. I am going to try to
 work up a patch this weekend which enables CORS. I'll let you know if I
 run
 into any issues.

 Thanks again,
 Tyler

 On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:



 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Tuesday, March 24, 2015 at 3:41 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Access Control Allow Origin

  Hi Folks,

 I took a stab at creating an example website to submit a file to the
 form
 resource of our VM. See http://tpalsulich.github.io/TikaExamples/.

 If I try to use AJAX to submit the request to make the page prettier
 (see
 the script in the head of the page (with ev.preventDefault() commented
 out), I get the following error:

 XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No
 'Access-Control-Allow-Origin' header is present on the requested
 resource.
 Origin 'http://tpalsulich.github.io' is therefore not allowed access.
 The
 response had HTTP status code 400.

 We can't allow the tika-server response header to accept * in general,
 since that isn't secure. So, would there be interest in including this
 sort
 of site on the VM? Then, the AJAX request won't be external and we won't
 have this error.

 The version button just takes you to the version resource on the VM
 (doesn't do anything with the file).

 Tyler








Re: Access Control Allow Origin

2015-04-01 Thread Tyler Palsulich
I'll change the option to -C right now. Just looked closer -- TIKA-1426 is
to provide a config for the server and app on the command line.

Tyler

On Wed, Apr 1, 2015 at 11:22 AM, Allison, Timothy B. talli...@mitre.org
wrote:

 Might be thinking of TIKA-944?

 Mind if we switch the CORS short option to -C and use -c for the tika
 config file?

 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Wednesday, April 01, 2015 11:13 AM
 To: dev@tika.apache.org
 Subject: Re: Access Control Allow Origin

 Thank you for the feedback!

 I think there's an issue (don't remember the number) to be able to specify
 a TikaConfig file for tika-server. So, I think that would be the ideal
 place to put more complex CORS configuration.

 Tyler

 On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com
 wrote:

  Hi Tyler
 
  Sorry for a delay, I was off for the last few days,
  The change you did looks fine, the filter can check the annotations or
 can
  be configured directly (which is what you did).
  It might make sense to consider checking a (Java) properties resource as
 a
  possible future enhancement, as a CORS filter may have many properties,
  May be if a '-cors' is provided then check a well-known class resource
  where all of the cors properties are set, if it is absent - default to
 '*'
  otherwise work with Properties...
  The current approach works too, might be tricky to extend it to support
  more properties but great for a start
 
  Thanks, Sergey
 
 
 
 
 
  On 27/03/15 18:56, Tyler Palsulich wrote:
 
  Thank you, Sergey! I didn't know about that feature. I am going to try
 to
  work up a patch this weekend which enables CORS. I'll let you know if I
  run
  into any issues.
 
  Thanks again,
  Tyler
 
  On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
 
 
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Tyler Palsulich tpalsul...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Tuesday, March 24, 2015 at 3:41 PM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Access Control Allow Origin
 
   Hi Folks,
 
  I took a stab at creating an example website to submit a file to the
  form
  resource of our VM. See http://tpalsulich.github.io/TikaExamples/.
 
  If I try to use AJAX to submit the request to make the page prettier
  (see
  the script in the head of the page (with ev.preventDefault() commented
  out), I get the following error:
 
  XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No
  'Access-Control-Allow-Origin' header is present on the requested
  resource.
  Origin 'http://tpalsulich.github.io' is therefore not allowed access.
  The
  response had HTTP status code 400.
 
  We can't allow the tika-server response header to accept * in
 general,
  since that isn't secure. So, would there be interest in including this
  sort
  of site on the VM? Then, the AJAX request won't be external and we
 won't
  have this error.
 
  The version button just takes you to the version resource on the VM
  (doesn't do anything with the file).
 
  Tyler
 
 
 
 
 
 



RE: Access Control Allow Origin

2015-04-01 Thread Allison, Timothy B.
Might be thinking of TIKA-944?

Mind if we switch the CORS short option to -C and use -c for the tika config 
file?

-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@gmail.com] 
Sent: Wednesday, April 01, 2015 11:13 AM
To: dev@tika.apache.org
Subject: Re: Access Control Allow Origin

Thank you for the feedback!

I think there's an issue (don't remember the number) to be able to specify
a TikaConfig file for tika-server. So, I think that would be the ideal
place to put more complex CORS configuration.

Tyler

On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com
wrote:

 Hi Tyler

 Sorry for a delay, I was off for the last few days,
 The change you did looks fine, the filter can check the annotations or can
 be configured directly (which is what you did).
 It might make sense to consider checking a (Java) properties resource as a
 possible future enhancement, as a CORS filter may have many properties,
 May be if a '-cors' is provided then check a well-known class resource
 where all of the cors properties are set, if it is absent - default to '*'
 otherwise work with Properties...
 The current approach works too, might be tricky to extend it to support
 more properties but great for a start

 Thanks, Sergey





 On 27/03/15 18:56, Tyler Palsulich wrote:

 Thank you, Sergey! I didn't know about that feature. I am going to try to
 work up a patch this weekend which enables CORS. I'll let you know if I
 run
 into any issues.

 Thanks again,
 Tyler

 On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:



 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Tuesday, March 24, 2015 at 3:41 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Access Control Allow Origin

  Hi Folks,

 I took a stab at creating an example website to submit a file to the
 form
 resource of our VM. See http://tpalsulich.github.io/TikaExamples/.

 If I try to use AJAX to submit the request to make the page prettier
 (see
 the script in the head of the page (with ev.preventDefault() commented
 out), I get the following error:

 XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No
 'Access-Control-Allow-Origin' header is present on the requested
 resource.
 Origin 'http://tpalsulich.github.io' is therefore not allowed access.
 The
 response had HTTP status code 400.

 We can't allow the tika-server response header to accept * in general,
 since that isn't secure. So, would there be interest in including this
 sort
 of site on the VM? Then, the AJAX request won't be external and we won't
 have this error.

 The version button just takes you to the version resource on the VM
 (doesn't do anything with the file).

 Tyler








[jira] [Comment Edited] (TIKA-1585) Create Example Website with Form Submission

2015-04-01 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390841#comment-14390841
 ] 

Tyler Palsulich edited comment on TIKA-1585 at 4/1/15 3:51 PM:
---

Done. It works. -I'll see if I can shut 9997 down right now.- Port 9997 is now 
closed.


was (Author: tpalsulich):
Done. It works. I'll see if I can shut 9997 down right now.

 Create Example Website with Form Submission
 ---

 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 It would be great to have a website where we can direct people who ask what 
 Tika can do for [filetype] without needing them to actually download Tika.
 Some initial work to do that is 
 [here|http://tpalsulich.github.io/TikaExamples/].
 I'm far from a design guru, but I imagine the site as having a form where you 
 can upload a file at the top, checkboxes for if you want metadata, content, 
 or both, and a submit button. The request should be sent with AJAX and the 
 result should populate a {{div}}.
 One issue with AJAX requests is that Tika Server doesn't currently allow 
 Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a 
 slightly updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Tyler Palsulich
All tests are passing. Only issue I see is excessive logging. The Hudson
failure does just look like a hiccup.

Tyler

On Wed, Apr 1, 2015 at 2:55 PM, Allison, Timothy B. talli...@mitre.org
wrote:

 This looks like a Hudson hiccup.

 Tyler is seeing excessive logging:
 Running org.apache.tika.cli.TikaCLIBatchIntegrationTest
 INFO - about to start driver
 INFO - about to start driver

 Anyone else having problems building from a fresh trunk?


 -Original Message-
 From: Hudson (JIRA) [mailto:j...@apache.org]
 Sent: Wednesday, April 01, 2015 5:36 PM
 To: dev@tika.apache.org
 Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code


 [
 https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ]

 Hudson commented on TIKA-1330:
 --

 ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [
 https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
 TIKA-1330 flush stacktrace writers (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
 TIKA-1330 clean up logging in tika-batch ant tika-app integration of
 tika-batch, take 2 (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


  Add robust tika-batch code
  --
 
  Key: TIKA-1330
  URL: https://issues.apache.org/jira/browse/TIKA-1330
  Project: Tika
   Issue Type: Sub-task
   Components: cli, general, server
 Reporter: Tim Allison
 Assignee: Tim Allison
  Attachments: TIKA-1330v1-patch.zip
 
 
  In my current design plan, I see creating a separate component
 tika-batch that includes a small bit of configurable code to run Tika
 against a large batch of documents.  This code should be robust against OOM
 and hangs, and it should have fairly robust logging.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



[jira] [Resolved] (TIKA-1591) Tika Parsers uses wrong version of bouncycastle

2015-04-01 Thread Konstantin Gribov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-1591.
-
   Resolution: Fixed
Fix Version/s: 1.8

Updated in r1670802

 Tika Parsers uses wrong version of bouncycastle
 ---

 Key: TIKA-1591
 URL: https://issues.apache.org/jira/browse/TIKA-1591
 Project: Tika
  Issue Type: Bug
Reporter: Ben McCann
Assignee: Konstantin Gribov
 Fix For: 1.8


 Tika uses:
 dependency
   groupIdorg.bouncycastle/groupId
   artifactIdbcmail-jdk15/artifactId
   version1.45/version
 /dependency
 It's not recommended to use bcmail-jdk15, which is the old artifact name. 
 Instead bcmail-jdk15on should be used (latest version is 1.52)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
 ] 

Luke sh edited comment on TIKA-1517 at 4/2/15 1:30 AM:
---

After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
whether we should take byte-stream as part of the decisions. After thinking 
about this this week for some time, i decide to ignore this byte-stream 
predicted by a test when taking the vote for which type should be used.
e.g. 
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system in this case is expected to return 
Byte-Stream as the final decision for the type detection when magic test trust 
values(i.e. the presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space where the algorithm can improved and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement and conducting the research with pros and cons; 
if any thoughts, please kindly let me know.


   


was (Author: lukeliush):
After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
whether we should take byte-stream as part of the decisions. After thinking 
about this this week for some time, i decide to ignore this byte-stream 
predicted by a test when taking the vote for which type should be used.
e.g. 
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system in this case is expected to return 
Byte-Stream as the final decision for the type detection when magic test trust 
values(i.e. the presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space where the algorithm can improved and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   

 MIME type selection with probability
 

 Key: 

[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392027#comment-14392027
 ] 

Hudson commented on TIKA-1330:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #599 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/599/])
TIKA-1330 fix logging in TikaCLI to avoid adding multiple appenders (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670804)
* /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* /tika/trunk/tika-app/src/main/resources/log4j.properties


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1323) Improve exception reporting in JAX-RS server

2015-04-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392026#comment-14392026
 ] 

Hudson commented on TIKA-1323:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #599 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/599/])
TIKA-1323: flush writer when printing stack trace (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670807)
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerParseExceptionMapper.java


 Improve exception reporting in JAX-RS server
 

 Key: TIKA-1323
 URL: https://issues.apache.org/jira/browse/TIKA-1323
 Project: Tika
  Issue Type: Improvement
  Components: server
Reporter: Tim Allison
Priority: Minor

 I'd like to use tika-server for TIKA-1302.  As part of that, I'd like to 
 record exception stacktraces per document.  I see two options: transmit the 
 info back to the client (assuming a doc didn't bring the server down :) ) 
 along with the current error code or log the document id and stacktrace via 
 the server.  Given my current design thoughts, I'd prefer the first option.
 Any objections or recommendations?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Allison, Timothy B.
On the duplication/triplication of 
 INFO - about to start driver
 INFO - about to start driver

was because main() adds a new appender with .configure()...so subsequent calls 
to main() in the tests were adding more appenders.  I just fixed that in 
r1670804.

What I can't figure out is why you're seeing anything.  I've redirected both 
stdout and stderr in setup() (annotated @Before) to ByteArrayOutputStreams.

If setup() weren't being called, you'd get NPEs for each of the four tests, so 
setup() must be getting calledh

-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@gmail.com] 
Sent: Wednesday, April 01, 2015 7:39 PM
To: dev@tika.apache.org
Subject: Re: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

All tests are passing. Only issue I see is excessive logging. The Hudson
failure does just look like a hiccup.

Tyler

On Wed, Apr 1, 2015 at 2:55 PM, Allison, Timothy B. talli...@mitre.org
wrote:

 This looks like a Hudson hiccup.

 Tyler is seeing excessive logging:
 Running org.apache.tika.cli.TikaCLIBatchIntegrationTest
 INFO - about to start driver
 INFO - about to start driver

 Anyone else having problems building from a fresh trunk?


 -Original Message-
 From: Hudson (JIRA) [mailto:j...@apache.org]
 Sent: Wednesday, April 01, 2015 5:36 PM
 To: dev@tika.apache.org
 Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code


 [
 https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ]

 Hudson commented on TIKA-1330:
 --

 ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [
 https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
 TIKA-1330 flush stacktrace writers (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
 TIKA-1330 clean up logging in tika-batch ant tika-app integration of
 tika-batch, take 2 (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


  Add robust tika-batch code
  --
 
  Key: TIKA-1330
  URL: https://issues.apache.org/jira/browse/TIKA-1330
  Project: Tika
   Issue Type: Sub-task
   Components: cli, general, server
 Reporter: Tim Allison
 Assignee: Tim Allison
  Attachments: TIKA-1330v1-patch.zip
 
 
  In my current design plan, I see creating a separate component
 tika-batch that includes a small bit of configurable code to run Tika
 against a large batch of documents.  This code should be robust against OOM
 and hangs, and it should have fairly robust logging.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



[jira] [Commented] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML

2015-04-01 Thread Matt Sheppard (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392196#comment-14392196
 ] 

Matt Sheppard commented on TIKA-1590:
-

Great, thanks - Looking forward to 1.8!

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Fix For: 1.8

 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:179)
   at 
 

[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391255#comment-14391255
 ] 

Hudson commented on TIKA-1330:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #595 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/595/])
TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch 
(tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670749)
* 
/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java
* /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* /tika/trunk/tika-app/src/main/resources/log4j_batch_process.properties
* 
/tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchIntegrationTest.java
* /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
* /tika/trunk/tika-app/src/test/resources/log4j_batch_process_test.properties
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcess.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParallelFileProcessingResult.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/AbstractFSConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/BasicTikaFSConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java
* /tika/trunk/tika-batch/src/test/resources/log4j.properties
* /tika/trunk/tika-batch/src/test/resources/log4j_process.properties


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server

2015-04-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391050#comment-14391050
 ] 

Hudson commented on TIKA-1586:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #594 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/594/])
TIKA-1586. Change CORS short option to -C. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670683)
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java


 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Allison, Timothy B.
This looks like a Hudson hiccup.

Tyler is seeing excessive logging:
Running org.apache.tika.cli.TikaCLIBatchIntegrationTest
INFO - about to start driver
INFO - about to start driver

Anyone else having problems building from a fresh trunk?


-Original Message-
From: Hudson (JIRA) [mailto:j...@apache.org] 
Sent: Wednesday, April 01, 2015 5:36 PM
To: dev@tika.apache.org
Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code


[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ] 

Hudson commented on TIKA-1330:
--

ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
TIKA-1330 flush stacktrace writers (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
TIKA-1330 clean up logging in tika-batch ant tika-app integration of 
tika-batch, take 2 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
 ] 

Luke sh commented on TIKA-1517:
---

After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to be cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
should we take byte-stream as part of the decisions. After thinking about this 
this week for some time, i decide to ignore this byte-stream predicted by a 
test when taking vote for decision on which type should be used.
e.g. 
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the 
final decision for the type detection when magic test trust values(i.e. the 
presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability 

[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ] 

Hudson commented on TIKA-1330:
--

ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
TIKA-1330 flush stacktrace writers (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
TIKA-1330 clean up logging in tika-batch ant tika-app integration of 
tika-batch, take 2 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


 Add robust tika-batch code
 --

 Key: TIKA-1330
 URL: https://issues.apache.org/jira/browse/TIKA-1330
 Project: Tika
  Issue Type: Sub-task
  Components: cli, general, server
Reporter: Tim Allison
Assignee: Tim Allison
 Attachments: TIKA-1330v1-patch.zip


 In my current design plan, I see creating a separate component tika-batch 
 that includes a small bit of configurable code to run Tika against a large 
 batch of documents.  This code should be robust against OOM and hangs, and it 
 should have fairly robust logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1591) Tika Parsers uses wrong version of bouncycastle

2015-04-01 Thread Ben McCann (JIRA)
Ben McCann created TIKA-1591:


 Summary: Tika Parsers uses wrong version of bouncycastle
 Key: TIKA-1591
 URL: https://issues.apache.org/jira/browse/TIKA-1591
 Project: Tika
  Issue Type: Bug
Reporter: Ben McCann


Tika uses:

dependency
  groupIdorg.bouncycastle/groupId
  artifactIdbcmail-jdk15/artifactId
  version1.45/version
/dependency

It's not recommended to use bcmail-jdk15, which is the old artifact name. 
Instead bcmail-jdk15on should be used (latest version is 1.52)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
 ] 

Luke sh edited comment on TIKA-1517 at 4/1/15 10:38 PM:


After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
whether we should take byte-stream as part of the decisions. After thinking 
about this this week for some time, i decide to ignore this byte-stream 
predicted by a test when taking vote for decision on which type should be used.
e.g. 
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the 
final decision for the type detection when magic test trust values(i.e. the 
presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   


was (Author: lukeliush):
After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
should we take byte-stream as part of the decisions. After thinking about this 
this week for some time, i decide to ignore this byte-stream predicted by a 
test when taking vote for decision on which type should be used.
e.g. 
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the 
final decision for the type detection when magic test trust values(i.e. the 
presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: 

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
 ] 

Luke sh edited comment on TIKA-1517 at 4/1/15 10:37 PM:


After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
should we take byte-stream as part of the decisions. After thinking about this 
this week for some time, i decide to ignore this byte-stream predicted by a 
test when taking vote for decision on which type should be used.
e.g. 
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the 
final decision for the type detection when magic test trust values(i.e. the 
presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   


was (Author: lukeliush):
After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to be cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
should we take byte-stream as part of the decisions. After thinking about this 
this week for some time, i decide to ignore this byte-stream predicted by a 
test when taking vote for decision on which type should be used.
e.g. 
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the 
final decision for the type detection when magic test trust values(i.e. the 
presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
 ] 

Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM:


After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
whether we should take byte-stream as part of the decisions. After thinking 
about this this week for some time, i decide to ignore this byte-stream 
predicted by a test when taking the vote for which type should be used.
e.g. 
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the 
final decision for the type detection when magic test trust values(i.e. the 
presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   


was (Author: lukeliush):
After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
whether we should take byte-stream as part of the decisions. After thinking 
about this this week for some time, i decide to ignore this byte-stream 
predicted by a test when taking vote for decision on which type should be used.
e.g. 
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the 
final decision for the type detection when magic test trust values(i.e. the 
presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: 

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
 ] 

Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM:


After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
whether we should take byte-stream as part of the decisions. After thinking 
about this this week for some time, i decide to ignore this byte-stream 
predicted by a test when taking the vote for which type should be used.
e.g. 
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the 
final decision for the type detection when magic test trust values(i.e. the 
presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   


was (Author: lukeliush):
After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
whether we should take byte-stream as part of the decisions. After thinking 
about this this week for some time, i decide to ignore this byte-stream 
predicted by a test when taking the vote for which type should be used.
e.g. 
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the 
final decision for the type detection when magic test trust values(i.e. the 
presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
 ] 

Luke sh edited comment on TIKA-1517 at 4/1/15 10:41 PM:


After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
whether we should take byte-stream as part of the decisions. After thinking 
about this this week for some time, i decide to ignore this byte-stream 
predicted by a test when taking the vote for which type should be used.
e.g. 
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system in this case is expected to return 
Byte-Stream as the final decision for the type detection when magic test trust 
values(i.e. the presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space where the algorithm can improved and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   


was (Author: lukeliush):
After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
whether we should take byte-stream as part of the decisions. After thinking 
about this this week for some time, i decide to ignore this byte-stream 
predicted by a test when taking the vote for which type should be used.
e.g. 
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system in this case is expected to return 
Byte-Stream as the final decision for the type detection when magic test trust 
values(i.e. the presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: 

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
 ] 

Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM:


After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
whether we should take byte-stream as part of the decisions. After thinking 
about this this week for some time, i decide to ignore this byte-stream 
predicted by a test when taking the vote for which type should be used.
e.g. 
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system in this case is expected to return 
Byte-Stream as the final decision for the type detection when magic test trust 
values(i.e. the presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   


was (Author: lukeliush):
After some research, it looks like the algorithm design with probabilistic mime 
type selection seems to cause some confusion vs Naive Bayesian, the idea is 
borrowed from Naive Bayesian, but it turns out giving up on some properties of 
Naive Bayesian seems to cause some confusions and therefore less intuitive 
although some computations are skimped.


One of the problems i have been considering is that when magic test fails to 
determine a type, it returns byte-stream, in the original design, the 
original design approaches this by taking byte-stream as a decision, so e.g. 
when the extension test returns a correct type (e.g. GRB) but the magic test 
returns the byte-stream meaning it fails to detect the type, the question is 
whether we should take byte-stream as part of the decisions. After thinking 
about this this week for some time, i decide to ignore this byte-stream 
predicted by a test when taking the vote for which type should be used.
e.g. 
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the 
final decision for the type detection when magic test trust values(i.e. the 
presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior 
might not be a good idea even though it simplifies a bit the computation, but 
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose 
prior is 50% percent of correctness, i.e. 50% of chance that the detected type 
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some 
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which 
seems to cause some confusion, but actually the causal reasoning and intuition 
might be a bit more important, i will also be optimizing and correcting some of 
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.


   

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: 

[jira] [Updated] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1517:
--
Priority: Trivial  (was: Major)

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
Priority: Trivial
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
  based on the samples, and this depends on the domain or use cases, 
  intuitively we more care the orders of the weights or probability of the 
  results rather than the actual numbers, and also the context of Prior 
  depends on samples for a particular use case or domain, e.g. if we happen 
  to crawl a website that contains mostly the pdf files, we probably can 
  collect some samples and compute the prior, based on the samples we can say 
  90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here 
  we propose to define the prior as configurable param for users, and by 
  default we leave the prior to be unapplicable. Alternatively, we can 
  define prior for each file type to be  1/[number of supported file types in 
  Tika] I think the number would be approximately 1/1157 and using this 
  number seems to be more fair, but the point of avoiding it is that this 
  prior is fixed for every type, and eventually we care more the orders of 
  the result and if the number is fixed, so will the order be, bringing this 
  number of 1/1157 into the Bayesian equation will not only be unable to 
  affect the order but also it will lumber our implementation with extra 
  computation, thus we will leave it as unapplicable which means we assign 
  1 to it as it never exists! but note we care more the order rather the 
  actual number, and this param is configurable, and we believe it provides 
  much flexibilities in some use cases.
  Conditional probability of positive tests given a file type P(test| 
  file_type) e.g. P(test1 = pdf | pdf), this probability is also based on 
  collection of samples and domain or use cases, we leave it configurable, 
  but based on our intuition we think test1(i.e. Magic-bytes method) is most 
  trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | 
  a_file_type), this is to say given the file whose type is a file type, 
  the probability of the test1 predicting the file is a_file_type is 0.75, 
  that is really our intuition, as we trust test1 most, next we propose to 
  use 0.7 for test3, and 0.65 for test2;
 (note again, test1 = magic-bytes, test2