[jira] [Updated] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML
[ https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-1590: Fix Version/s: 1.8 A particular PDF seems to trigger an infinite loop when being converted to HTML --- Key: TIKA-1590 URL: https://issues.apache.org/jira/browse/TIKA-1590 Project: Tika Issue Type: Bug Affects Versions: 1.6, 1.7 Reporter: Matt Sheppard Fix For: 1.8 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, jstack.txt The PDF at http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf (which I'll also attach) appears to trigger an infinite loop (or at least is exceedingly slow) when being filtered by Tika. {noformat} java -jar tika-app-1.7.jar National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=date content=2015-02-05T04:48:30Z/ meta name=pdf:PDFVersion content=1.6/ meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/ meta name=dc:description content=Licensee Improvement/ meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=subject content=Licensee Improvement/ meta name=dc:creator content=Comcare/ meta name=description content=Licensee Improvement/ meta name=dcterms:created content=2014-10-07T02:46:10Z/ meta name=Last-Modified content=2015-02-05T04:48:30Z/ meta name=dcterms:modified content=2015-02-05T04:48:30Z/ meta name=dc:format content=application/pdf; version=1.6/ meta name=Last-Save-Date content=2015-02-05T04:48:30Z/ meta name=meta:save-date content=2015-02-05T04:48:30Z/ meta name=pdf:encrypted content=false/ meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit report/ meta name=modified content=2015-02-05T04:48:30Z/ meta name=cp:subject content=Licensee Improvement/ meta name=Content-Length content=299338/ meta name=Content-Type content=application/pdf/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/ meta name=creator content=Comcare/ meta name=meta:author content=Comcare/ meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=trapped content=False/ meta name=meta:creation-date content=2014-10-07T02:46:10Z/ meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/ meta name=xmpTPg:NPages content=72/ meta name=Creation-Date content=2014-10-07T02:46:10Z/ meta name=resourceName content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/ meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=Author content=Comcare/ meta name=producer content=Adobe PDF Library 11.0/ titleLicensee Improvement Program NAT (CTH) audit report/title /head bodydiv class=pagep/ pLICENSEE IMPROVEMENT PROGRAM [snip] /p pFinding: /p pEvidence: /p pComment: /p pObservation: /p pNon-conformance: /p p [just appears to hand forever at this point] {noformat} The relevant thread's stack is something like... {noformat} main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable [0x00010fc18000] java.lang.Thread.State: RUNNABLE at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:179) at org.apache.pdfbox.pdmodel.interactive.form.PDFieldFactory.isButton(PDFieldFactory.java:157) at
[jira] [Comment Edited] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML
[ https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390170#comment-14390170 ] Konstantin Gribov edited comment on TIKA-1590 at 4/1/15 7:53 AM: - Fixed in trunk by update of pdfbox to 1.8.9. See alse TIKA-1575 and PDFBOX-2261. was (Author: grossws): Fixed in trunk by update of pdfbox to 1.8.9. See alse TIKA-1575 and PDFBOX-2710. A particular PDF seems to trigger an infinite loop when being converted to HTML --- Key: TIKA-1590 URL: https://issues.apache.org/jira/browse/TIKA-1590 Project: Tika Issue Type: Bug Affects Versions: 1.6, 1.7 Reporter: Matt Sheppard Fix For: 1.8 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, jstack.txt The PDF at http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf (which I'll also attach) appears to trigger an infinite loop (or at least is exceedingly slow) when being filtered by Tika. {noformat} java -jar tika-app-1.7.jar National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=date content=2015-02-05T04:48:30Z/ meta name=pdf:PDFVersion content=1.6/ meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/ meta name=dc:description content=Licensee Improvement/ meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=subject content=Licensee Improvement/ meta name=dc:creator content=Comcare/ meta name=description content=Licensee Improvement/ meta name=dcterms:created content=2014-10-07T02:46:10Z/ meta name=Last-Modified content=2015-02-05T04:48:30Z/ meta name=dcterms:modified content=2015-02-05T04:48:30Z/ meta name=dc:format content=application/pdf; version=1.6/ meta name=Last-Save-Date content=2015-02-05T04:48:30Z/ meta name=meta:save-date content=2015-02-05T04:48:30Z/ meta name=pdf:encrypted content=false/ meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit report/ meta name=modified content=2015-02-05T04:48:30Z/ meta name=cp:subject content=Licensee Improvement/ meta name=Content-Length content=299338/ meta name=Content-Type content=application/pdf/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/ meta name=creator content=Comcare/ meta name=meta:author content=Comcare/ meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=trapped content=False/ meta name=meta:creation-date content=2014-10-07T02:46:10Z/ meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/ meta name=xmpTPg:NPages content=72/ meta name=Creation-Date content=2014-10-07T02:46:10Z/ meta name=resourceName content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/ meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=Author content=Comcare/ meta name=producer content=Adobe PDF Library 11.0/ titleLicensee Improvement Program NAT (CTH) audit report/title /head bodydiv class=pagep/ pLICENSEE IMPROVEMENT PROGRAM [snip] /p pFinding: /p pEvidence: /p pComment: /p pObservation: /p pNon-conformance: /p p [just appears to hand forever at this point] {noformat} The relevant thread's stack is something like... {noformat} main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable [0x00010fc18000] java.lang.Thread.State: RUNNABLE at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at
[jira] [Commented] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML
[ https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390177#comment-14390177 ] Konstantin Gribov commented on TIKA-1590: - Thank you for the feedback, Matt. I think, it's the same problem, as was in PDFBOX-2261. It's currently fixed in trunk. A particular PDF seems to trigger an infinite loop when being converted to HTML --- Key: TIKA-1590 URL: https://issues.apache.org/jira/browse/TIKA-1590 Project: Tika Issue Type: Bug Affects Versions: 1.6, 1.7 Reporter: Matt Sheppard Fix For: 1.8 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, jstack.txt The PDF at http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf (which I'll also attach) appears to trigger an infinite loop (or at least is exceedingly slow) when being filtered by Tika. {noformat} java -jar tika-app-1.7.jar National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=date content=2015-02-05T04:48:30Z/ meta name=pdf:PDFVersion content=1.6/ meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/ meta name=dc:description content=Licensee Improvement/ meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=subject content=Licensee Improvement/ meta name=dc:creator content=Comcare/ meta name=description content=Licensee Improvement/ meta name=dcterms:created content=2014-10-07T02:46:10Z/ meta name=Last-Modified content=2015-02-05T04:48:30Z/ meta name=dcterms:modified content=2015-02-05T04:48:30Z/ meta name=dc:format content=application/pdf; version=1.6/ meta name=Last-Save-Date content=2015-02-05T04:48:30Z/ meta name=meta:save-date content=2015-02-05T04:48:30Z/ meta name=pdf:encrypted content=false/ meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit report/ meta name=modified content=2015-02-05T04:48:30Z/ meta name=cp:subject content=Licensee Improvement/ meta name=Content-Length content=299338/ meta name=Content-Type content=application/pdf/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/ meta name=creator content=Comcare/ meta name=meta:author content=Comcare/ meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=trapped content=False/ meta name=meta:creation-date content=2014-10-07T02:46:10Z/ meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/ meta name=xmpTPg:NPages content=72/ meta name=Creation-Date content=2014-10-07T02:46:10Z/ meta name=resourceName content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/ meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=Author content=Comcare/ meta name=producer content=Adobe PDF Library 11.0/ titleLicensee Improvement Program NAT (CTH) audit report/title /head bodydiv class=pagep/ pLICENSEE IMPROVEMENT PROGRAM [snip] /p pFinding: /p pEvidence: /p pComment: /p pObservation: /p pNon-conformance: /p p [just appears to hand forever at this point] {noformat} The relevant thread's stack is something like... {noformat} main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable [0x00010fc18000] java.lang.Thread.State: RUNNABLE at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at
Re: Access Control Allow Origin
Hi Tyler Sorry for a delay, I was off for the last few days, The change you did looks fine, the filter can check the annotations or can be configured directly (which is what you did). It might make sense to consider checking a (Java) properties resource as a possible future enhancement, as a CORS filter may have many properties, May be if a '-cors' is provided then check a well-known class resource where all of the cors properties are set, if it is absent - default to '*' otherwise work with Properties... The current approach works too, might be tricky to extend it to support more properties but great for a start Thanks, Sergey On 27/03/15 18:56, Tyler Palsulich wrote: Thank you, Sergey! I didn't know about that feature. I am going to try to work up a patch this weekend which enables CORS. I'll let you know if I run into any issues. Thanks again, Tyler On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, March 24, 2015 at 3:41 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Access Control Allow Origin Hi Folks, I took a stab at creating an example website to submit a file to the form resource of our VM. See http://tpalsulich.github.io/TikaExamples/. If I try to use AJAX to submit the request to make the page prettier (see the script in the head of the page (with ev.preventDefault() commented out), I get the following error: XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://tpalsulich.github.io' is therefore not allowed access. The response had HTTP status code 400. We can't allow the tika-server response header to accept * in general, since that isn't secure. So, would there be interest in including this sort of site on the VM? Then, the AJAX request won't be external and we won't have this error. The version button just takes you to the version resource on the VM (doesn't do anything with the file). Tyler
[jira] [Commented] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML
[ https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390469#comment-14390469 ] Tim Allison commented on TIKA-1590: --- Not that this is needed, but I doubly confirmed that this file no longer causes a hang with Tika trunk and PDFBox 1.8.9. Many thanks to [~tilman] and [~msahyoun] for fixing this! A particular PDF seems to trigger an infinite loop when being converted to HTML --- Key: TIKA-1590 URL: https://issues.apache.org/jira/browse/TIKA-1590 Project: Tika Issue Type: Bug Affects Versions: 1.6, 1.7 Reporter: Matt Sheppard Fix For: 1.8 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, jstack.txt The PDF at http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf (which I'll also attach) appears to trigger an infinite loop (or at least is exceedingly slow) when being filtered by Tika. {noformat} java -jar tika-app-1.7.jar National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=date content=2015-02-05T04:48:30Z/ meta name=pdf:PDFVersion content=1.6/ meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/ meta name=dc:description content=Licensee Improvement/ meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=subject content=Licensee Improvement/ meta name=dc:creator content=Comcare/ meta name=description content=Licensee Improvement/ meta name=dcterms:created content=2014-10-07T02:46:10Z/ meta name=Last-Modified content=2015-02-05T04:48:30Z/ meta name=dcterms:modified content=2015-02-05T04:48:30Z/ meta name=dc:format content=application/pdf; version=1.6/ meta name=Last-Save-Date content=2015-02-05T04:48:30Z/ meta name=meta:save-date content=2015-02-05T04:48:30Z/ meta name=pdf:encrypted content=false/ meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit report/ meta name=modified content=2015-02-05T04:48:30Z/ meta name=cp:subject content=Licensee Improvement/ meta name=Content-Length content=299338/ meta name=Content-Type content=application/pdf/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/ meta name=creator content=Comcare/ meta name=meta:author content=Comcare/ meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=trapped content=False/ meta name=meta:creation-date content=2014-10-07T02:46:10Z/ meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/ meta name=xmpTPg:NPages content=72/ meta name=Creation-Date content=2014-10-07T02:46:10Z/ meta name=resourceName content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/ meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=Author content=Comcare/ meta name=producer content=Adobe PDF Library 11.0/ titleLicensee Improvement Program NAT (CTH) audit report/title /head bodydiv class=pagep/ pLICENSEE IMPROVEMENT PROGRAM [snip] /p pFinding: /p pEvidence: /p pComment: /p pObservation: /p pNon-conformance: /p p [just appears to hand forever at this point] {noformat} The relevant thread's stack is something like... {noformat} main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable [0x00010fc18000] java.lang.Thread.State: RUNNABLE at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
[jira] [Resolved] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML
[ https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-1590. - Resolution: Duplicate Fixed in trunk by update of pdfbox to 1.8.9. See alse TIKA-1575 and PDFBOX-2710. A particular PDF seems to trigger an infinite loop when being converted to HTML --- Key: TIKA-1590 URL: https://issues.apache.org/jira/browse/TIKA-1590 Project: Tika Issue Type: Bug Affects Versions: 1.6, 1.7 Reporter: Matt Sheppard Fix For: 1.8 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, jstack.txt The PDF at http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf (which I'll also attach) appears to trigger an infinite loop (or at least is exceedingly slow) when being filtered by Tika. {noformat} java -jar tika-app-1.7.jar National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=date content=2015-02-05T04:48:30Z/ meta name=pdf:PDFVersion content=1.6/ meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/ meta name=dc:description content=Licensee Improvement/ meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=subject content=Licensee Improvement/ meta name=dc:creator content=Comcare/ meta name=description content=Licensee Improvement/ meta name=dcterms:created content=2014-10-07T02:46:10Z/ meta name=Last-Modified content=2015-02-05T04:48:30Z/ meta name=dcterms:modified content=2015-02-05T04:48:30Z/ meta name=dc:format content=application/pdf; version=1.6/ meta name=Last-Save-Date content=2015-02-05T04:48:30Z/ meta name=meta:save-date content=2015-02-05T04:48:30Z/ meta name=pdf:encrypted content=false/ meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit report/ meta name=modified content=2015-02-05T04:48:30Z/ meta name=cp:subject content=Licensee Improvement/ meta name=Content-Length content=299338/ meta name=Content-Type content=application/pdf/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/ meta name=creator content=Comcare/ meta name=meta:author content=Comcare/ meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=trapped content=False/ meta name=meta:creation-date content=2014-10-07T02:46:10Z/ meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/ meta name=xmpTPg:NPages content=72/ meta name=Creation-Date content=2014-10-07T02:46:10Z/ meta name=resourceName content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/ meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=Author content=Comcare/ meta name=producer content=Adobe PDF Library 11.0/ titleLicensee Improvement Program NAT (CTH) audit report/title /head bodydiv class=pagep/ pLICENSEE IMPROVEMENT PROGRAM [snip] /p pFinding: /p pEvidence: /p pComment: /p pObservation: /p pNon-conformance: /p p [just appears to hand forever at this point] {noformat} The relevant thread's stack is something like... {noformat} main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable [0x00010fc18000] java.lang.Thread.State: RUNNABLE at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:179) at
Re: Access Control Allow Origin
Thank you for the feedback! I think there's an issue (don't remember the number) to be able to specify a TikaConfig file for tika-server. So, I think that would be the ideal place to put more complex CORS configuration. Tyler On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com wrote: Hi Tyler Sorry for a delay, I was off for the last few days, The change you did looks fine, the filter can check the annotations or can be configured directly (which is what you did). It might make sense to consider checking a (Java) properties resource as a possible future enhancement, as a CORS filter may have many properties, May be if a '-cors' is provided then check a well-known class resource where all of the cors properties are set, if it is absent - default to '*' otherwise work with Properties... The current approach works too, might be tricky to extend it to support more properties but great for a start Thanks, Sergey On 27/03/15 18:56, Tyler Palsulich wrote: Thank you, Sergey! I didn't know about that feature. I am going to try to work up a patch this weekend which enables CORS. I'll let you know if I run into any issues. Thanks again, Tyler On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, March 24, 2015 at 3:41 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Access Control Allow Origin Hi Folks, I took a stab at creating an example website to submit a file to the form resource of our VM. See http://tpalsulich.github.io/TikaExamples/. If I try to use AJAX to submit the request to make the page prettier (see the script in the head of the page (with ev.preventDefault() commented out), I get the following error: XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://tpalsulich.github.io' is therefore not allowed access. The response had HTTP status code 400. We can't allow the tika-server response header to accept * in general, since that isn't secure. So, would there be interest in including this sort of site on the VM? Then, the AJAX request won't be external and we won't have this error. The version button just takes you to the version resource on the VM (doesn't do anything with the file). Tyler
Re: Access Control Allow Origin
I'll change the option to -C right now. Just looked closer -- TIKA-1426 is to provide a config for the server and app on the command line. Tyler On Wed, Apr 1, 2015 at 11:22 AM, Allison, Timothy B. talli...@mitre.org wrote: Might be thinking of TIKA-944? Mind if we switch the CORS short option to -C and use -c for the tika config file? -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Wednesday, April 01, 2015 11:13 AM To: dev@tika.apache.org Subject: Re: Access Control Allow Origin Thank you for the feedback! I think there's an issue (don't remember the number) to be able to specify a TikaConfig file for tika-server. So, I think that would be the ideal place to put more complex CORS configuration. Tyler On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com wrote: Hi Tyler Sorry for a delay, I was off for the last few days, The change you did looks fine, the filter can check the annotations or can be configured directly (which is what you did). It might make sense to consider checking a (Java) properties resource as a possible future enhancement, as a CORS filter may have many properties, May be if a '-cors' is provided then check a well-known class resource where all of the cors properties are set, if it is absent - default to '*' otherwise work with Properties... The current approach works too, might be tricky to extend it to support more properties but great for a start Thanks, Sergey On 27/03/15 18:56, Tyler Palsulich wrote: Thank you, Sergey! I didn't know about that feature. I am going to try to work up a patch this weekend which enables CORS. I'll let you know if I run into any issues. Thanks again, Tyler On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, March 24, 2015 at 3:41 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Access Control Allow Origin Hi Folks, I took a stab at creating an example website to submit a file to the form resource of our VM. See http://tpalsulich.github.io/TikaExamples/. If I try to use AJAX to submit the request to make the page prettier (see the script in the head of the page (with ev.preventDefault() commented out), I get the following error: XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://tpalsulich.github.io' is therefore not allowed access. The response had HTTP status code 400. We can't allow the tika-server response header to accept * in general, since that isn't secure. So, would there be interest in including this sort of site on the VM? Then, the AJAX request won't be external and we won't have this error. The version button just takes you to the version resource on the VM (doesn't do anything with the file). Tyler
RE: Access Control Allow Origin
Might be thinking of TIKA-944? Mind if we switch the CORS short option to -C and use -c for the tika config file? -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Wednesday, April 01, 2015 11:13 AM To: dev@tika.apache.org Subject: Re: Access Control Allow Origin Thank you for the feedback! I think there's an issue (don't remember the number) to be able to specify a TikaConfig file for tika-server. So, I think that would be the ideal place to put more complex CORS configuration. Tyler On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com wrote: Hi Tyler Sorry for a delay, I was off for the last few days, The change you did looks fine, the filter can check the annotations or can be configured directly (which is what you did). It might make sense to consider checking a (Java) properties resource as a possible future enhancement, as a CORS filter may have many properties, May be if a '-cors' is provided then check a well-known class resource where all of the cors properties are set, if it is absent - default to '*' otherwise work with Properties... The current approach works too, might be tricky to extend it to support more properties but great for a start Thanks, Sergey On 27/03/15 18:56, Tyler Palsulich wrote: Thank you, Sergey! I didn't know about that feature. I am going to try to work up a patch this weekend which enables CORS. I'll let you know if I run into any issues. Thanks again, Tyler On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, March 24, 2015 at 3:41 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Access Control Allow Origin Hi Folks, I took a stab at creating an example website to submit a file to the form resource of our VM. See http://tpalsulich.github.io/TikaExamples/. If I try to use AJAX to submit the request to make the page prettier (see the script in the head of the page (with ev.preventDefault() commented out), I get the following error: XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://tpalsulich.github.io' is therefore not allowed access. The response had HTTP status code 400. We can't allow the tika-server response header to accept * in general, since that isn't secure. So, would there be interest in including this sort of site on the VM? Then, the AJAX request won't be external and we won't have this error. The version button just takes you to the version resource on the VM (doesn't do anything with the file). Tyler
[jira] [Comment Edited] (TIKA-1585) Create Example Website with Form Submission
[ https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390841#comment-14390841 ] Tyler Palsulich edited comment on TIKA-1585 at 4/1/15 3:51 PM: --- Done. It works. -I'll see if I can shut 9997 down right now.- Port 9997 is now closed. was (Author: tpalsulich): Done. It works. I'll see if I can shut 9997 down right now. Create Example Website with Form Submission --- Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Commented] (TIKA-1330) Add robust tika-batch code
All tests are passing. Only issue I see is excessive logging. The Hudson failure does just look like a hiccup. Tyler On Wed, Apr 1, 2015 at 2:55 PM, Allison, Timothy B. talli...@mitre.org wrote: This looks like a Hudson hiccup. Tyler is seeing excessive logging: Running org.apache.tika.cli.TikaCLIBatchIntegrationTest INFO - about to start driver INFO - about to start driver Anyone else having problems building from a fresh trunk? -Original Message- From: Hudson (JIRA) [mailto:j...@apache.org] Sent: Wednesday, April 01, 2015 5:36 PM To: dev@tika.apache.org Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code [ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512 ] Hudson commented on TIKA-1330: -- ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [ https://builds.apache.org/job/tika-trunk-jdk1.7/596/]) TIKA-1330 flush stacktrace writers (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch, take 2 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1591) Tika Parsers uses wrong version of bouncycastle
[ https://issues.apache.org/jira/browse/TIKA-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-1591. - Resolution: Fixed Fix Version/s: 1.8 Updated in r1670802 Tika Parsers uses wrong version of bouncycastle --- Key: TIKA-1591 URL: https://issues.apache.org/jira/browse/TIKA-1591 Project: Tika Issue Type: Bug Reporter: Ben McCann Assignee: Konstantin Gribov Fix For: 1.8 Tika uses: dependency groupIdorg.bouncycastle/groupId artifactIdbcmail-jdk15/artifactId version1.45/version /dependency It's not recommended to use bcmail-jdk15, which is the old artifact name. Instead bcmail-jdk15on should be used (latest version is 1.52) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/2/15 1:30 AM: --- After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system in this case is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space where the algorithm can improved and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement and conducting the research with pros and cons; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system in this case is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space where the algorithm can improved and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key:
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392027#comment-14392027 ] Hudson commented on TIKA-1330: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #599 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/599/]) TIKA-1330 fix logging in TikaCLI to avoid adding multiple appenders (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670804) * /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java * /tika/trunk/tika-app/src/main/resources/log4j.properties Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1323) Improve exception reporting in JAX-RS server
[ https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392026#comment-14392026 ] Hudson commented on TIKA-1323: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #599 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/599/]) TIKA-1323: flush writer when printing stack trace (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670807) * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerParseExceptionMapper.java Improve exception reporting in JAX-RS server Key: TIKA-1323 URL: https://issues.apache.org/jira/browse/TIKA-1323 Project: Tika Issue Type: Improvement Components: server Reporter: Tim Allison Priority: Minor I'd like to use tika-server for TIKA-1302. As part of that, I'd like to record exception stacktraces per document. I see two options: transmit the info back to the client (assuming a doc didn't bring the server down :) ) along with the current error code or log the document id and stacktrace via the server. Given my current design thoughts, I'd prefer the first option. Any objections or recommendations? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: [jira] [Commented] (TIKA-1330) Add robust tika-batch code
On the duplication/triplication of INFO - about to start driver INFO - about to start driver was because main() adds a new appender with .configure()...so subsequent calls to main() in the tests were adding more appenders. I just fixed that in r1670804. What I can't figure out is why you're seeing anything. I've redirected both stdout and stderr in setup() (annotated @Before) to ByteArrayOutputStreams. If setup() weren't being called, you'd get NPEs for each of the four tests, so setup() must be getting calledh -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Wednesday, April 01, 2015 7:39 PM To: dev@tika.apache.org Subject: Re: [jira] [Commented] (TIKA-1330) Add robust tika-batch code All tests are passing. Only issue I see is excessive logging. The Hudson failure does just look like a hiccup. Tyler On Wed, Apr 1, 2015 at 2:55 PM, Allison, Timothy B. talli...@mitre.org wrote: This looks like a Hudson hiccup. Tyler is seeing excessive logging: Running org.apache.tika.cli.TikaCLIBatchIntegrationTest INFO - about to start driver INFO - about to start driver Anyone else having problems building from a fresh trunk? -Original Message- From: Hudson (JIRA) [mailto:j...@apache.org] Sent: Wednesday, April 01, 2015 5:36 PM To: dev@tika.apache.org Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code [ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512 ] Hudson commented on TIKA-1330: -- ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [ https://builds.apache.org/job/tika-trunk-jdk1.7/596/]) TIKA-1330 flush stacktrace writers (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch, take 2 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML
[ https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392196#comment-14392196 ] Matt Sheppard commented on TIKA-1590: - Great, thanks - Looking forward to 1.8! A particular PDF seems to trigger an infinite loop when being converted to HTML --- Key: TIKA-1590 URL: https://issues.apache.org/jira/browse/TIKA-1590 Project: Tika Issue Type: Bug Affects Versions: 1.6, 1.7 Reporter: Matt Sheppard Fix For: 1.8 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, jstack.txt The PDF at http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf (which I'll also attach) appears to trigger an infinite loop (or at least is exceedingly slow) when being filtered by Tika. {noformat} java -jar tika-app-1.7.jar National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=date content=2015-02-05T04:48:30Z/ meta name=pdf:PDFVersion content=1.6/ meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/ meta name=dc:description content=Licensee Improvement/ meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=subject content=Licensee Improvement/ meta name=dc:creator content=Comcare/ meta name=description content=Licensee Improvement/ meta name=dcterms:created content=2014-10-07T02:46:10Z/ meta name=Last-Modified content=2015-02-05T04:48:30Z/ meta name=dcterms:modified content=2015-02-05T04:48:30Z/ meta name=dc:format content=application/pdf; version=1.6/ meta name=Last-Save-Date content=2015-02-05T04:48:30Z/ meta name=meta:save-date content=2015-02-05T04:48:30Z/ meta name=pdf:encrypted content=false/ meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit report/ meta name=modified content=2015-02-05T04:48:30Z/ meta name=cp:subject content=Licensee Improvement/ meta name=Content-Length content=299338/ meta name=Content-Type content=application/pdf/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/ meta name=creator content=Comcare/ meta name=meta:author content=Comcare/ meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=trapped content=False/ meta name=meta:creation-date content=2014-10-07T02:46:10Z/ meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/ meta name=xmpTPg:NPages content=72/ meta name=Creation-Date content=2014-10-07T02:46:10Z/ meta name=resourceName content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/ meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=Author content=Comcare/ meta name=producer content=Adobe PDF Library 11.0/ titleLicensee Improvement Program NAT (CTH) audit report/title /head bodydiv class=pagep/ pLICENSEE IMPROVEMENT PROGRAM [snip] /p pFinding: /p pEvidence: /p pComment: /p pObservation: /p pNon-conformance: /p p [just appears to hand forever at this point] {noformat} The relevant thread's stack is something like... {noformat} main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable [0x00010fc18000] java.lang.Thread.State: RUNNABLE at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:179) at
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391255#comment-14391255 ] Hudson commented on TIKA-1330: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #595 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/595/]) TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670749) * /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java * /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java * /tika/trunk/tika-app/src/main/resources/log4j_batch_process.properties * /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchIntegrationTest.java * /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java * /tika/trunk/tika-app/src/test/resources/log4j_batch_process_test.properties * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcess.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParallelFileProcessingResult.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/AbstractFSConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/BasicTikaFSConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java * /tika/trunk/tika-batch/src/test/resources/log4j.properties * /tika/trunk/tika-batch/src/test/resources/log4j_process.properties Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391050#comment-14391050 ] Hudson commented on TIKA-1586: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #594 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/594/]) TIKA-1586. Change CORS short option to -C. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670683) * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java Enable CORS on Tika Server -- Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: [jira] [Commented] (TIKA-1330) Add robust tika-batch code
This looks like a Hudson hiccup. Tyler is seeing excessive logging: Running org.apache.tika.cli.TikaCLIBatchIntegrationTest INFO - about to start driver INFO - about to start driver Anyone else having problems building from a fresh trunk? -Original Message- From: Hudson (JIRA) [mailto:j...@apache.org] Sent: Wednesday, April 01, 2015 5:36 PM To: dev@tika.apache.org Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code [ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512 ] Hudson commented on TIKA-1330: -- ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/596/]) TIKA-1330 flush stacktrace writers (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch, take 2 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh commented on TIKA-1517: --- After some research, it looks like the algorithm design with probabilistic mime type selection seems to be cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is should we take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking vote for decision on which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512 ] Hudson commented on TIKA-1330: -- ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/596/]) TIKA-1330 flush stacktrace writers (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch, take 2 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1591) Tika Parsers uses wrong version of bouncycastle
Ben McCann created TIKA-1591: Summary: Tika Parsers uses wrong version of bouncycastle Key: TIKA-1591 URL: https://issues.apache.org/jira/browse/TIKA-1591 Project: Tika Issue Type: Bug Reporter: Ben McCann Tika uses: dependency groupIdorg.bouncycastle/groupId artifactIdbcmail-jdk15/artifactId version1.45/version /dependency It's not recommended to use bcmail-jdk15, which is the old artifact name. Instead bcmail-jdk15on should be used (latest version is 1.52) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/1/15 10:38 PM: After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking vote for decision on which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is should we take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking vote for decision on which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components:
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/1/15 10:37 PM: After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is should we take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking vote for decision on which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to be cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is should we take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking vote for decision on which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM: After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking vote for decision on which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components:
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM: After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/1/15 10:41 PM: After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system in this case is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space where the algorithm can improved and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system in this case is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL:
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM: After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system in this case is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project:
[jira] [Updated] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1517: -- Priority: Trivial (was: Major) MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Priority: Trivial Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we can say 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to define the prior as configurable param for users, and by default we leave the prior to be unapplicable. Alternatively, we can define prior for each file type to be 1/[number of supported file types in Tika] I think the number would be approximately 1/1157 and using this number seems to be more fair, but the point of avoiding it is that this prior is fixed for every type, and eventually we care more the orders of the result and if the number is fixed, so will the order be, bringing this number of 1/1157 into the Bayesian equation will not only be unable to affect the order but also it will lumber our implementation with extra computation, thus we will leave it as unapplicable which means we assign 1 to it as it never exists! but note we care more the order rather the actual number, and this param is configurable, and we believe it provides much flexibilities in some use cases. Conditional probability of positive tests given a file type P(test| file_type) e.g. P(test1 = pdf | pdf), this probability is also based on collection of samples and domain or use cases, we leave it configurable, but based on our intuition we think test1(i.e. Magic-bytes method) is most trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | a_file_type), this is to say given the file whose type is a file type, the probability of the test1 predicting the file is a_file_type is 0.75, that is really our intuition, as we trust test1 most, next we propose to use 0.7 for test3, and 0.65 for test2; (note again, test1 = magic-bytes, test2