[GitHub] tika pull request: fix for TIKA-1589 contributed by mdaniline
GitHub user mdaniline opened a pull request: https://github.com/apache/tika/pull/38 fix for TIKA-1589 contributed by mdaniline https://issues.apache.org/jira/browse/TIKA-1589 You can merge this pull request into a Git repository by running: $ git pull https://github.com/mdaniline/tika TIKA-1589 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/38.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #38 commit fb29412710ea058f89d3c6df5078587768dcac74 Author: Max Daniline maxim.danil...@softwire.com Date: 2015-03-31T12:49:43Z fix for TIKA-1589 contributed by mdaniline --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags
[ https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388478#comment-14388478 ] ASF GitHub Bot commented on TIKA-1589: -- GitHub user mdaniline opened a pull request: https://github.com/apache/tika/pull/38 fix for TIKA-1589 contributed by mdaniline https://issues.apache.org/jira/browse/TIKA-1589 You can merge this pull request into a Git repository by running: $ git pull https://github.com/mdaniline/tika TIKA-1589 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/38.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #38 commit fb29412710ea058f89d3c6df5078587768dcac74 Author: Max Daniline maxim.danil...@softwire.com Date: 2015-03-31T12:49:43Z fix for TIKA-1589 contributed by mdaniline Mp3 parser does not add duration to metadata if there are no ID3 tags - Key: TIKA-1589 URL: https://issues.apache.org/jira/browse/TIKA-1589 Project: Tika Issue Type: Bug Reporter: Max Daniline Steps to reproduce: * Have a file without any ID3 tags (v1 or v2) * Parse the file * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'. Expected result: The duration should be set even for a file without ID3 tags, since it is independent information. Actual result: The duration is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags
[ https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388496#comment-14388496 ] Max Daniline commented on TIKA-1589: I've raised a PR to fix this: https://github.com/apache/tika/pull/38 Mp3 parser does not add duration to metadata if there are no ID3 tags - Key: TIKA-1589 URL: https://issues.apache.org/jira/browse/TIKA-1589 Project: Tika Issue Type: Bug Reporter: Max Daniline Steps to reproduce: * Have a file without any ID3 tags (v1 or v2) * Parse the file * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'. Expected result: The duration should be set even for a file without ID3 tags, since it is independent information. Actual result: The duration is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags
[ https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Daniline updated TIKA-1589: --- Comment: was deleted (was: I've raised a PR to fix this: https://github.com/apache/tika/pull/38) Mp3 parser does not add duration to metadata if there are no ID3 tags - Key: TIKA-1589 URL: https://issues.apache.org/jira/browse/TIKA-1589 Project: Tika Issue Type: Bug Reporter: Max Daniline Steps to reproduce: * Have a file without any ID3 tags (v1 or v2) * Parse the file * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'. Expected result: The duration should be set even for a file without ID3 tags, since it is independent information. Actual result: The duration is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags
Max Daniline created TIKA-1589: -- Summary: Mp3 parser does not add duration to metadata if there are no ID3 tags Key: TIKA-1589 URL: https://issues.apache.org/jira/browse/TIKA-1589 Project: Tika Issue Type: Bug Reporter: Max Daniline Steps to reproduce: * Have a file without any ID3 tags (v1 or v2) * Parse the file * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'. Expected result: The duration should be set even for a file without ID3 tags, since it is independent information. Actual result: The duration is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags
[ https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388467#comment-14388467 ] Nick Burch commented on TIKA-1589: -- Any chance you could create a small mp3 file (probably silent, ideally something like 10-50kb in size) which shows the problem, for which we know the duration? We can then use that for a unit test, to ensure that when we fix it it all stays fixed! Mp3 parser does not add duration to metadata if there are no ID3 tags - Key: TIKA-1589 URL: https://issues.apache.org/jira/browse/TIKA-1589 Project: Tika Issue Type: Bug Reporter: Max Daniline Steps to reproduce: * Have a file without any ID3 tags (v1 or v2) * Parse the file * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'. Expected result: The duration should be set even for a file without ID3 tags, since it is independent information. Actual result: The duration is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389901#comment-14389901 ] Rishi Verma commented on TIKA-1577: --- Hi Annie, Chris, That architecture looks good, although I don't know if we'd be able to leverage any code from NCDumpW to help develop TikaParser or ScientificContentHandler. We might want to give some thought to a CSV type output as well. I think that would have broad applicability for client applications. NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.8 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML
[ https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Sheppard updated TIKA-1590: Attachment: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf jstack.txt Attached jstack output and the PDF in case the source is changed before this can be solved. A particular PDF seems to trigger an infinite loop when being converted to HTML --- Key: TIKA-1590 URL: https://issues.apache.org/jira/browse/TIKA-1590 Project: Tika Issue Type: Bug Affects Versions: 1.6, 1.7 Reporter: Matt Sheppard Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, jstack.txt The PDF at http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf (which I'll also attach) appears to trigger an infinite loop (or at least is exceedingly slow) when being filtered by Tika. {noformat} java -jar tika-app-1.7.jar National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=date content=2015-02-05T04:48:30Z/ meta name=pdf:PDFVersion content=1.6/ meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/ meta name=dc:description content=Licensee Improvement/ meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=subject content=Licensee Improvement/ meta name=dc:creator content=Comcare/ meta name=description content=Licensee Improvement/ meta name=dcterms:created content=2014-10-07T02:46:10Z/ meta name=Last-Modified content=2015-02-05T04:48:30Z/ meta name=dcterms:modified content=2015-02-05T04:48:30Z/ meta name=dc:format content=application/pdf; version=1.6/ meta name=Last-Save-Date content=2015-02-05T04:48:30Z/ meta name=meta:save-date content=2015-02-05T04:48:30Z/ meta name=pdf:encrypted content=false/ meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit report/ meta name=modified content=2015-02-05T04:48:30Z/ meta name=cp:subject content=Licensee Improvement/ meta name=Content-Length content=299338/ meta name=Content-Type content=application/pdf/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/ meta name=creator content=Comcare/ meta name=meta:author content=Comcare/ meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=trapped content=False/ meta name=meta:creation-date content=2014-10-07T02:46:10Z/ meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/ meta name=xmpTPg:NPages content=72/ meta name=Creation-Date content=2014-10-07T02:46:10Z/ meta name=resourceName content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/ meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=Author content=Comcare/ meta name=producer content=Adobe PDF Library 11.0/ titleLicensee Improvement Program NAT (CTH) audit report/title /head bodydiv class=pagep/ pLICENSEE IMPROVEMENT PROGRAM [snip] /p pFinding: /p pEvidence: /p pComment: /p pObservation: /p pNon-conformance: /p p [just appears to hand forever at this point] {noformat} The relevant thread's stack is something like... {noformat} main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable [0x00010fc18000] java.lang.Thread.State: RUNNABLE at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at
[jira] [Created] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML
Matt Sheppard created TIKA-1590: --- Summary: A particular PDF seems to trigger an infinite loop when being converted to HTML Key: TIKA-1590 URL: https://issues.apache.org/jira/browse/TIKA-1590 Project: Tika Issue Type: Bug Affects Versions: 1.7, 1.6 Reporter: Matt Sheppard The PDF at http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf (which I'll also attach) appears to trigger an infinite loop (or at least is exceedingly slow) when being filtered by Tika. {noformat} java -jar tika-app-1.7.jar National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=date content=2015-02-05T04:48:30Z/ meta name=pdf:PDFVersion content=1.6/ meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/ meta name=dc:description content=Licensee Improvement/ meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=subject content=Licensee Improvement/ meta name=dc:creator content=Comcare/ meta name=description content=Licensee Improvement/ meta name=dcterms:created content=2014-10-07T02:46:10Z/ meta name=Last-Modified content=2015-02-05T04:48:30Z/ meta name=dcterms:modified content=2015-02-05T04:48:30Z/ meta name=dc:format content=application/pdf; version=1.6/ meta name=Last-Save-Date content=2015-02-05T04:48:30Z/ meta name=meta:save-date content=2015-02-05T04:48:30Z/ meta name=pdf:encrypted content=false/ meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit report/ meta name=modified content=2015-02-05T04:48:30Z/ meta name=cp:subject content=Licensee Improvement/ meta name=Content-Length content=299338/ meta name=Content-Type content=application/pdf/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/ meta name=creator content=Comcare/ meta name=meta:author content=Comcare/ meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=trapped content=False/ meta name=meta:creation-date content=2014-10-07T02:46:10Z/ meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/ meta name=xmpTPg:NPages content=72/ meta name=Creation-Date content=2014-10-07T02:46:10Z/ meta name=resourceName content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/ meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, Report#13;#10;/ meta name=Author content=Comcare/ meta name=producer content=Adobe PDF Library 11.0/ titleLicensee Improvement Program NAT (CTH) audit report/title /head bodydiv class=pagep/ pLICENSEE IMPROVEMENT PROGRAM [snip] /p pFinding: /p pEvidence: /p pComment: /p pObservation: /p pNon-conformance: /p p [just appears to hand forever at this point] {noformat} The relevant thread's stack is something like... {noformat} main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable [0x00010fc18000] java.lang.Thread.State: RUNNABLE at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190) at org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:179) at org.apache.pdfbox.pdmodel.interactive.form.PDFieldFactory.isButton(PDFieldFactory.java:157) at org.apache.pdfbox.pdmodel.interactive.form.PDFieldFactory.createField(PDFieldFactory.java:68) at org.apache.pdfbox.pdmodel.interactive.form.PDField.getKids(PDField.java:550) at org.apache.pdfbox.pdmodel.interactive.form.PDFieldFactory.isButton(PDFieldFactory.java:159) at org.apache.pdfbox.pdmodel.interactive.form.PDFieldFactory.isButton(PDFieldFactory.java:178)
[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags
[ https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388568#comment-14388568 ] Hudson commented on TIKA-1589: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #591 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/591/]) TIKA-1589 - Patch from Max Daniline to extract MP3 duration from files with no ID3 tags. This closes #38 from github (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670330) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/testMP3noid3.mp3 Mp3 parser does not add duration to metadata if there are no ID3 tags - Key: TIKA-1589 URL: https://issues.apache.org/jira/browse/TIKA-1589 Project: Tika Issue Type: Bug Reporter: Max Daniline Fix For: 1.8 Steps to reproduce: * Have a file without any ID3 tags (v1 or v2) * Parse the file * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'. Expected result: The duration should be set even for a file without ID3 tags, since it is independent information. Actual result: The duration is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Tika 1.8 or 1.7.1
Also I can run the RC on a subset of ImageCat [1] to test the new RC too when it’s ready. Cheers, Chris [1] https://github.com/chrismattmann/imagecat/ ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, March 30, 2015 at 3:22 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 I just remembered TIKA-1509 and TIKA-1558 -- testing now for blacklist functionality through TIKA-1509. If that works, I'll back out TIKA-1558. Tim, I think you should run govdocs from the RC, in case something changes between your run and the cut. Tyler On Mon, Mar 30, 2015 at 10:17 AM, Allison, Timothy B. talli...@mitre.org wrote: All, I've made the changes that I had hoped to. Grib pdf exclusion remains for any takers. Let me know when I should initiate the run against govdocs1 to see if there are any surprises on that corpus with Tika 1.8. Best, Tim -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, March 30, 2015 7:03 AM To: dev@tika.apache.org Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 Unless there are objections, I'd like these to be resolved before 1.8: TIKA-1584 -- I'll fix TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll leave this open and do some more digging to see if we need to open a ticket at the POI level TIKA-1511 -- I'll remove provided for xerial TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? I'll have these fixes completed by noon EDT. Should I run against govdocs1 before or after the RC? My last build of Tika app (a few days ago) ballooned to ~43MB, and that's before I add ~3MB for xerial. Tika server is now ~48MB. As of my last build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server jars. Best, Tim -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Sunday, March 29, 2015 9:13 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless something else pops up). Thank you everyone. Tyler On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote: +1 for 1.8 Hong-Thai On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
Re: Broken build because of clirr plugin
No worries Konstantin thank you! Thanks Tim! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Konstantin Gribov gros...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, March 30, 2015 at 8:18 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: Broken build because of clirr plugin I think, simple way would be to keep old methods (and mark them @Deprecated) to avoid build failure. And use new ones internally. I'll do `mvn verify` before commiting this time. Sorry for inconvenience. -- Best regards, Konstantin Gribov пн, 30 марта 2015 г. в 18:09, Allison, Timothy B. talli...@mitre.org: How much of an effort would it be to migrate somewhat slowly: Leave in but deprecate setCommandLine(String ) and String getCommandLine() Add something like: setCommandLineArr(String[] ) and String[] getCommandLineArr()? -Original Message- From: Konstantin Gribov [mailto:gros...@gmail.com] Sent: Monday, March 30, 2015 11:00 AM To: dev@tika.apache.org Subject: Broken build because of clirr plugin Hi, folks. I've broken build (by commit r1670105 for TIKA-1587). Should I revert this commit and change it to preserve old API or add exclude to clirr plugin configuration? -- Best regards, Konstantin Gribov
Re: [DISCUSS] Tika 1.8 or 1.7.1
+1 to running tika-batch and govdocs. Woot. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, March 30, 2015 at 3:22 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 I just remembered TIKA-1509 and TIKA-1558 -- testing now for blacklist functionality through TIKA-1509. If that works, I'll back out TIKA-1558. Tim, I think you should run govdocs from the RC, in case something changes between your run and the cut. Tyler On Mon, Mar 30, 2015 at 10:17 AM, Allison, Timothy B. talli...@mitre.org wrote: All, I've made the changes that I had hoped to. Grib pdf exclusion remains for any takers. Let me know when I should initiate the run against govdocs1 to see if there are any surprises on that corpus with Tika 1.8. Best, Tim -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, March 30, 2015 7:03 AM To: dev@tika.apache.org Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 Unless there are objections, I'd like these to be resolved before 1.8: TIKA-1584 -- I'll fix TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll leave this open and do some more digging to see if we need to open a ticket at the POI level TIKA-1511 -- I'll remove provided for xerial TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? I'll have these fixes completed by noon EDT. Should I run against govdocs1 before or after the RC? My last build of Tika app (a few days ago) ballooned to ~43MB, and that's before I add ~3MB for xerial. Tika server is now ~48MB. As of my last build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server jars. Best, Tim -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Sunday, March 29, 2015 9:13 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless something else pops up). Thank you everyone. Tyler On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote: +1 for 1.8 Hong-Thai On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
Re: including refactored docs from govdocs1 in test suite
+1 to including the modified docs. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Allison, Timothy B. talli...@mitre.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, March 30, 2015 at 6:51 AM To: dev@tika.apache.org dev@tika.apache.org Subject: RE: including refactored docs from govdocs1 in test suite I think this is an open question within Tika. Some parsers prefer one thing over another. And there are different levels of corruption. In the two cases where govdocs1 docs might be useful in tests, the hyperlinks in .doc files do not appear to be standard, but MSWord opens them without a problem. In cases where an application can open and correctly process the content, I think we ought to try to extract content without throwing exceptions. -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Monday, March 30, 2015 9:39 AM To: dev@tika.apache.org Subject: RE: including refactored docs from govdocs1 in test suite Ah. I see. In general, what is the goal with handling corrupted files? Extract as much as possible and fail gracefully? Tyler On Mar 30, 2015 9:32 AM, Allison, Timothy B. talli...@mitre.org wrote: Unfortunately, no. MSOffice fixes the document when I do that. -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Monday, March 30, 2015 9:24 AM To: dev@tika.apache.org Subject: Re: including refactored docs from govdocs1 in test suite Can you copy the hyperlink into a new doc and change the URL? I have no idea about including the modified version. Tyler On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote: All, As part of TIKA-1512, I found that I can delete all of the contents, including the metadata, except for one hyperlink in two documents from govdocs1 and still get the proper behavior -- fail before fix, work after fix. These documents are in the public domain. Is it ok to include these modified documents in our test suite or should I avoid inclusion? Happy to avoid inclusion for the sake of a quick release of 1.8 and then we have time to discuss/determine way ahead... unless the answer is obvious. Best, Tim -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, March 30, 2015 7:03 AM To: dev@tika.apache.org Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 Unless there are objections, I'd like these to be resolved before 1.8: TIKA-1584 -- I'll fix TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll leave this open and do some more digging to see if we need to open a ticket at the POI level TIKA-1511 -- I'll remove provided for xerial TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? I'll have these fixes completed by noon EDT. Should I run against govdocs1 before or after the RC? My last build of Tika app (a few days ago) ballooned to ~43MB, and that's before I add ~3MB for xerial. Tika server is now ~48MB. As of my last build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server jars. Best, Tim -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Sunday, March 29, 2015 9:13 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless something else pops up). Thank you everyone. Tyler On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote: +1 for 1.8 Hong-Thai On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
Re: svn commit: r1670135 - /tika/trunk/CHANGES.txt
Thanks Ken! :) ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: thaicha...@apache.org thaicha...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, March 30, 2015 at 9:05 AM To: comm...@tika.apache.org comm...@tika.apache.org Subject: svn commit: r1670135 - /tika/trunk/CHANGES.txt Author: thaichat04 Date: Mon Mar 30 16:05:17 2015 New Revision: 1670135 URL: http://svn.apache.org/r1670135 Log: TIKA-1581 - Mention @kkrugler thanks in CHANGES.txt Modified: tika/trunk/CHANGES.txt Modified: tika/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/tika/trunk/CHANGES.txt?rev=1670135r1=1670134 r2=1670135view=diff == --- tika/trunk/CHANGES.txt (original) +++ tika/trunk/CHANGES.txt Mon Mar 30 16:05:17 2015 @@ -8,7 +8,8 @@ Release 1.8 - Current Development * Tika server can now enable CORS requests with the command line --cors option (TIKA-1586). - * Update jhighlight dependency to avoid using LGPL license (TIKA-1581) + * Update jhighlight dependency to avoid using LGPL license (TIKA-1581). + Thank @kkrugler for his great contribution * Updated HDF and NetCDF parsers to output file version in metadata (TIKA-1578 and TIKA-1579).
[GitHub] tika pull request: Refactor TIKA-1558. Remove service loading blac...
GitHub user tpalsulich opened a pull request: https://github.com/apache/tika/pull/39 Refactor TIKA-1558. Remove service loading blacklist * Remove all direct service loading logic regarding a blacklist. * Small changes to CompositeParser logic to make sure subclasses of excluded Parsers are also excluded. * Added new testing in the tika-core module to test regular and subclass exclusion. @Gagravarr, can you look this over? You can merge this pull request into a Git repository by running: $ git pull https://github.com/tpalsulich/tika TIKA-1558 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/39.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #39 commit 7e38e3cdef3f5ae11d45863c67c6216561802a32 Author: Tyler Palsulich tpalsul...@gmail.com Date: 2015-03-31T17:05:19Z Refactor TIKA-1558. Remove service loading blacklist and ensure subclasses are also excluded. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] tika pull request: fix for TIKA-1589 contributed by mdaniline
Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/38 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags
[ https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388505#comment-14388505 ] Nick Burch commented on TIKA-1589: -- Applied with small tweaks in r1670330. (You seem to have slightly different import-formatting rules to everyone else, might be worth double checking that before you next patch) Thanks! Mp3 parser does not add duration to metadata if there are no ID3 tags - Key: TIKA-1589 URL: https://issues.apache.org/jira/browse/TIKA-1589 Project: Tika Issue Type: Bug Reporter: Max Daniline Steps to reproduce: * Have a file without any ID3 tags (v1 or v2) * Parse the file * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'. Expected result: The duration should be set even for a file without ID3 tags, since it is independent information. Actual result: The duration is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags
[ https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388504#comment-14388504 ] ASF GitHub Bot commented on TIKA-1589: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/38 Mp3 parser does not add duration to metadata if there are no ID3 tags - Key: TIKA-1589 URL: https://issues.apache.org/jira/browse/TIKA-1589 Project: Tika Issue Type: Bug Reporter: Max Daniline Steps to reproduce: * Have a file without any ID3 tags (v1 or v2) * Parse the file * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'. Expected result: The duration should be set even for a file without ID3 tags, since it is independent information. Actual result: The duration is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags
[ https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1589. -- Resolution: Fixed Fix Version/s: 1.8 Mp3 parser does not add duration to metadata if there are no ID3 tags - Key: TIKA-1589 URL: https://issues.apache.org/jira/browse/TIKA-1589 Project: Tika Issue Type: Bug Reporter: Max Daniline Fix For: 1.8 Steps to reproduce: * Have a file without any ID3 tags (v1 or v2) * Parse the file * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'. Expected result: The duration should be set even for a file without ID3 tags, since it is independent information. Actual result: The duration is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389656#comment-14389656 ] Hudson commented on TIKA-1558: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #592 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/592/]) TIKA-1558. Better error message and fix typo. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670490) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/config/TikaParserConfigTest.java TIKA-1558. Refactor Parser blacklisting. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670487) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java * /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java * /tika/trunk/tika-core/src/test/java/org/apache/tika/parser/BlacklistedParser.java * /tika/trunk/tika-core/src/test/java/org/apache/tika/parser/BlacklistedParserSubclass.java * /tika/trunk/tika-core/src/test/java/org/apache/tika/parser/BlacklistedParserTest.java * /tika/trunk/tika-core/src/test/resources/META-INF * /tika/trunk/tika-core/src/test/resources/org/apache/tika/parser/blacklist2_file.blacklist2 * /tika/trunk/tika-core/src/test/resources/org/apache/tika/parser/blacklist_file.blacklist * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/config/TikaParserConfigTest.java * /tika/trunk/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-1558-blacklistsub.xml Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. -So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.- -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1558: -- Description: As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. -So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.- was: As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}. Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. -So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.- -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1432#comment-1432 ] Tyler Palsulich edited comment on TIKA-1558 at 3/31/15 9:41 PM: -Above strategy added in r1661284. You can now blacklist Parsers by adding names to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the same format as the normal services file. If a class is blacklisted, all of its subclasses are automatically blacklisted.- Edit: Service loading blacklisting disabled in r1670487. Use a custom TikaConfig like [this one|https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-1558-blacklistsub.xml] to disable a Parser. Any subclasses of that Parser will also be excluded. was (Author: tpalsulich): Above strategy added in r1661284. You can now blacklist Parsers by adding names to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the same format as the normal services file. If a class is blacklisted, all of its subclasses are automatically blacklisted. Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. -So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.- -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389464#comment-14389464 ] ASF GitHub Bot commented on TIKA-1558: -- Github user tpalsulich closed the pull request at: https://github.com/apache/tika/pull/39 Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)