[GitHub] tika pull request: fix for TIKA-1589 contributed by mdaniline

2015-03-31 Thread mdaniline
GitHub user mdaniline opened a pull request:

https://github.com/apache/tika/pull/38

fix for TIKA-1589 contributed by mdaniline

https://issues.apache.org/jira/browse/TIKA-1589

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mdaniline/tika TIKA-1589

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/38.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #38


commit fb29412710ea058f89d3c6df5078587768dcac74
Author: Max Daniline maxim.danil...@softwire.com
Date:   2015-03-31T12:49:43Z

fix for TIKA-1589 contributed by mdaniline




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags

2015-03-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388478#comment-14388478
 ] 

ASF GitHub Bot commented on TIKA-1589:
--

GitHub user mdaniline opened a pull request:

https://github.com/apache/tika/pull/38

fix for TIKA-1589 contributed by mdaniline

https://issues.apache.org/jira/browse/TIKA-1589

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mdaniline/tika TIKA-1589

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/38.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #38


commit fb29412710ea058f89d3c6df5078587768dcac74
Author: Max Daniline maxim.danil...@softwire.com
Date:   2015-03-31T12:49:43Z

fix for TIKA-1589 contributed by mdaniline




 Mp3 parser does not add duration to metadata if there are no ID3 tags
 -

 Key: TIKA-1589
 URL: https://issues.apache.org/jira/browse/TIKA-1589
 Project: Tika
  Issue Type: Bug
Reporter: Max Daniline

 Steps to reproduce:
 * Have a file without any ID3 tags (v1 or v2)
 * Parse the file
 * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'.
 Expected result:
 The duration should be set even for a file without ID3 tags, since it is 
 independent information.
 Actual result:
 The duration is null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags

2015-03-31 Thread Max Daniline (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388496#comment-14388496
 ] 

Max Daniline commented on TIKA-1589:


I've raised a PR to fix this: https://github.com/apache/tika/pull/38

 Mp3 parser does not add duration to metadata if there are no ID3 tags
 -

 Key: TIKA-1589
 URL: https://issues.apache.org/jira/browse/TIKA-1589
 Project: Tika
  Issue Type: Bug
Reporter: Max Daniline

 Steps to reproduce:
 * Have a file without any ID3 tags (v1 or v2)
 * Parse the file
 * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'.
 Expected result:
 The duration should be set even for a file without ID3 tags, since it is 
 independent information.
 Actual result:
 The duration is null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags

2015-03-31 Thread Max Daniline (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Daniline updated TIKA-1589:
---
Comment: was deleted

(was: I've raised a PR to fix this: https://github.com/apache/tika/pull/38)

 Mp3 parser does not add duration to metadata if there are no ID3 tags
 -

 Key: TIKA-1589
 URL: https://issues.apache.org/jira/browse/TIKA-1589
 Project: Tika
  Issue Type: Bug
Reporter: Max Daniline

 Steps to reproduce:
 * Have a file without any ID3 tags (v1 or v2)
 * Parse the file
 * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'.
 Expected result:
 The duration should be set even for a file without ID3 tags, since it is 
 independent information.
 Actual result:
 The duration is null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags

2015-03-31 Thread Max Daniline (JIRA)
Max Daniline created TIKA-1589:
--

 Summary: Mp3 parser does not add duration to metadata if there are 
no ID3 tags
 Key: TIKA-1589
 URL: https://issues.apache.org/jira/browse/TIKA-1589
 Project: Tika
  Issue Type: Bug
Reporter: Max Daniline


Steps to reproduce:

* Have a file without any ID3 tags (v1 or v2)
* Parse the file
* Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'.

Expected result:
The duration should be set even for a file without ID3 tags, since it is 
independent information.

Actual result:
The duration is null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags

2015-03-31 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388467#comment-14388467
 ] 

Nick Burch commented on TIKA-1589:
--

Any chance you could create a small mp3 file (probably silent, ideally 
something like 10-50kb in size) which shows the problem, for which we know the 
duration? We can then use that for a unit test, to ensure that when we fix it 
it all stays fixed!

 Mp3 parser does not add duration to metadata if there are no ID3 tags
 -

 Key: TIKA-1589
 URL: https://issues.apache.org/jira/browse/TIKA-1589
 Project: Tika
  Issue Type: Bug
Reporter: Max Daniline

 Steps to reproduce:
 * Have a file without any ID3 tags (v1 or v2)
 * Parse the file
 * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'.
 Expected result:
 The duration should be set even for a file without ID3 tags, since it is 
 independent information.
 Actual result:
 The duration is null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1577) NetCDF Data Extraction

2015-03-31 Thread Rishi Verma (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389901#comment-14389901
 ] 

Rishi Verma commented on TIKA-1577:
---

Hi Annie, Chris,

That architecture looks good, although I don't know if we'd be able to leverage 
any code from NCDumpW to help develop TikaParser or ScientificContentHandler.

We might want to give some thought to a CSV type output as well. I think that 
would have broad applicability for client applications.



 NetCDF Data Extraction
 --

 Key: TIKA-1577
 URL: https://issues.apache.org/jira/browse/TIKA-1577
 Project: Tika
  Issue Type: Improvement
  Components: handler, parser
Affects Versions: 1.7
Reporter: Ann Burgess
Assignee: Ann Burgess
  Labels: features, handler
 Fix For: 1.8

   Original Estimate: 504h
  Remaining Estimate: 504h

 A netCDF classic or 64-bit offset dataset is stored as a single file 
 comprising two parts:
  - a header, containing all the information about dimensions, attributes, and 
 variables except for the variable data;
  - a data part, comprising fixed-size data, containing the data for variables 
 that don't have an unlimited dimension; and variable-size data, containing 
 the data for variables that have an unlimited dimension.
 The NetCDFparser currently extracts the header part.  
  -- text extracts file Dimensions and Variables
  -- metadata extracts Global Attributes
 We want the option to extract the data part of NetCDF files.  
 Lets use the NetCDF test file for our dev testing:  
 tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML

2015-03-31 Thread Matt Sheppard (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Sheppard updated TIKA-1590:

Attachment: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
jstack.txt

Attached jstack output and the PDF in case the source is changed before this 
can be solved.

 A particular PDF seems to trigger an infinite loop when being converted to 
 HTML
 ---

 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6, 1.7
Reporter: Matt Sheppard
 Attachments: National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf, 
 jstack.txt


 The PDF at 
 http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
  (which I'll also attach) appears to trigger an infinite loop (or at least is 
 exceedingly slow) when being filtered by Tika.
 {noformat}
 java -jar tika-app-1.7.jar 
 National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=date content=2015-02-05T04:48:30Z/
 meta name=pdf:PDFVersion content=1.6/
 meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
 meta name=dc:description content=Licensee Improvement/
 meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=subject content=Licensee Improvement/
 meta name=dc:creator content=Comcare/
 meta name=description content=Licensee Improvement/
 meta name=dcterms:created content=2014-10-07T02:46:10Z/
 meta name=Last-Modified content=2015-02-05T04:48:30Z/
 meta name=dcterms:modified content=2015-02-05T04:48:30Z/
 meta name=dc:format content=application/pdf; version=1.6/
 meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
 meta name=meta:save-date content=2015-02-05T04:48:30Z/
 meta name=pdf:encrypted content=false/
 meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
 report/
 meta name=modified content=2015-02-05T04:48:30Z/
 meta name=cp:subject content=Licensee Improvement/
 meta name=Content-Length content=299338/
 meta name=Content-Type content=application/pdf/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
 meta name=creator content=Comcare/
 meta name=meta:author content=Comcare/
 meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=trapped content=False/
 meta name=meta:creation-date content=2014-10-07T02:46:10Z/
 meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
 meta name=xmpTPg:NPages content=72/
 meta name=Creation-Date content=2014-10-07T02:46:10Z/
 meta name=resourceName 
 content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
 meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
 Report#13;#10;/
 meta name=Author content=Comcare/
 meta name=producer content=Adobe PDF Library 11.0/
 titleLicensee Improvement Program NAT (CTH) audit report/title
 /head
 bodydiv class=pagep/
 pLICENSEE
 IMPROVEMENT
 PROGRAM
 [snip]
 /p
 pFinding:
 /p
 pEvidence:
 /p
 pComment:
 /p
 pObservation:
 /p
 pNon-conformance:
 /p
 p
 [just appears to hand forever at this point]
 {noformat}
 The relevant thread's stack is something like...
 {noformat}
 main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
 [0x00010fc18000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
   at 
 

[jira] [Created] (TIKA-1590) A particular PDF seems to trigger an infinite loop when being converted to HTML

2015-03-31 Thread Matt Sheppard (JIRA)
Matt Sheppard created TIKA-1590:
---

 Summary: A particular PDF seems to trigger an infinite loop when 
being converted to HTML
 Key: TIKA-1590
 URL: https://issues.apache.org/jira/browse/TIKA-1590
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7, 1.6
Reporter: Matt Sheppard


The PDF at 
http://www.comcare.gov.au/__data/assets/pdf_file/0019/117244/National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
 (which I'll also attach) appears to trigger an infinite loop (or at least is 
exceedingly slow) when being filtered by Tika.

{noformat}
java -jar tika-app-1.7.jar National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf
?xml version=1.0 encoding=UTF-8?html 
xmlns=http://www.w3.org/1999/xhtml;
head
meta name=date content=2015-02-05T04:48:30Z/
meta name=pdf:PDFVersion content=1.6/
meta name=xmp:CreatorTool content=Adobe InDesign CC 2014 (Macintosh)/
meta name=dc:description content=Licensee Improvement/
meta name=Keywords content=Licensee, Improvement, Program, NAT, CTH, 
Report#13;#10;/
meta name=subject content=Licensee Improvement/
meta name=dc:creator content=Comcare/
meta name=description content=Licensee Improvement/
meta name=dcterms:created content=2014-10-07T02:46:10Z/
meta name=Last-Modified content=2015-02-05T04:48:30Z/
meta name=dcterms:modified content=2015-02-05T04:48:30Z/
meta name=dc:format content=application/pdf; version=1.6/
meta name=Last-Save-Date content=2015-02-05T04:48:30Z/
meta name=meta:save-date content=2015-02-05T04:48:30Z/
meta name=pdf:encrypted content=false/
meta name=dc:title content=Licensee Improvement Program NAT (CTH) audit 
report/
meta name=modified content=2015-02-05T04:48:30Z/
meta name=cp:subject content=Licensee Improvement/
meta name=Content-Length content=299338/
meta name=Content-Type content=application/pdf/
meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
meta name=X-Parsed-By content=org.apache.tika.parser.pdf.PDFParser/
meta name=creator content=Comcare/
meta name=meta:author content=Comcare/
meta name=dc:subject content=Licensee, Improvement, Program, NAT, CTH, 
Report#13;#10;/
meta name=trapped content=False/
meta name=meta:creation-date content=2014-10-07T02:46:10Z/
meta name=created content=Tue Oct 07 13:46:10 AEDT 2014/
meta name=xmpTPg:NPages content=72/
meta name=Creation-Date content=2014-10-07T02:46:10Z/
meta name=resourceName 
content=National_Audit_tool_CTH_Audit_Report_PDF,_292_KB.pdf/
meta name=meta:keyword content=Licensee, Improvement, Program, NAT, CTH, 
Report#13;#10;/
meta name=Author content=Comcare/
meta name=producer content=Adobe PDF Library 11.0/
titleLicensee Improvement Program NAT (CTH) audit report/title
/head
bodydiv class=pagep/
pLICENSEE
IMPROVEMENT
PROGRAM

[snip]

/p
pFinding:
/p
pEvidence:
/p
pComment:
/p
pObservation:
/p
pNon-conformance:
/p
p

[just appears to hand forever at this point]
{noformat}


The relevant thread's stack is something like...

{noformat}
main #1 prio=5 os_prio=31 tid=0x7fbd6900b000 nid=0xf07 runnable 
[0x00010fc18000]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:184)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:190)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.findFieldType(PDField.java:179)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDFieldFactory.isButton(PDFieldFactory.java:157)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDFieldFactory.createField(PDFieldFactory.java:68)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDField.getKids(PDField.java:550)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDFieldFactory.isButton(PDFieldFactory.java:159)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDFieldFactory.isButton(PDFieldFactory.java:178)

[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags

2015-03-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388568#comment-14388568
 ] 

Hudson commented on TIKA-1589:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #591 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/591/])
TIKA-1589 - Patch from Max Daniline to extract MP3 duration from files with no 
ID3 tags. This closes #38 from github (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670330)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testMP3noid3.mp3


 Mp3 parser does not add duration to metadata if there are no ID3 tags
 -

 Key: TIKA-1589
 URL: https://issues.apache.org/jira/browse/TIKA-1589
 Project: Tika
  Issue Type: Bug
Reporter: Max Daniline
 Fix For: 1.8


 Steps to reproduce:
 * Have a file without any ID3 tags (v1 or v2)
 * Parse the file
 * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'.
 Expected result:
 The duration should be set even for a file without ID3 tags, since it is 
 independent information.
 Actual result:
 The duration is null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-31 Thread Mattmann, Chris A (3980)
Also I can run the RC on a subset of ImageCat [1] to test the
new RC too when it’s ready.

Cheers,
Chris

[1] https://github.com/chrismattmann/imagecat/


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Tyler Palsulich tpalsul...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Monday, March 30, 2015 at 3:22 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1

I just remembered TIKA-1509 and TIKA-1558 -- testing now for blacklist
functionality through TIKA-1509. If that works, I'll back out TIKA-1558.

Tim, I think you should run govdocs from the RC, in case something changes
between your run and the cut.

Tyler

On Mon, Mar 30, 2015 at 10:17 AM, Allison, Timothy B. talli...@mitre.org
wrote:

 All,

 I've made the changes that I had hoped to.  Grib pdf exclusion remains
for
 any takers.

 Let me know when I should initiate the run against govdocs1 to see if
 there are any surprises on that corpus with Tika 1.8.

 Best,

 Tim

 -Original Message-
 From: Allison, Timothy B. [mailto:talli...@mitre.org]
 Sent: Monday, March 30, 2015 7:03 AM
 To: dev@tika.apache.org
 Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1

 Unless there are objections, I'd like these to be resolved before 1.8:

 TIKA-1584 -- I'll fix
 TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
 TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
but
 I'll leave this open and do some more digging to see if we need to open
a
 ticket at the POI level
 TIKA-1511 -- I'll remove provided for xerial

 TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?

 I'll have these fixes completed by noon EDT.  Should I run against
 govdocs1 before or after the RC?

 My last build of Tika app (a few days ago) ballooned to ~43MB, and
that's
 before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
 build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
 README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
tika-server
 jars.

 Best,

   Tim



 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Sunday, March 29, 2015 9:13 AM
 To: dev@tika.apache.org
 Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1

 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
 something else pops up).

 Thank you everyone.

 Tyler
 On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com
wrote:

  +1 for 1.8
 
  Hong-Thai
 
   On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org
 wrote:
  
   Hi Folks,
  
   Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
need
 to
   release a new version of Tika. I'll volunteer to be the release
manager
   again.
  
   Should we release this as 1.8 or 1.7.1?
  
   Does anyone have any last minute issues they'd like to finish and
see
 in
   Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
and
   TIKA-1586). Any others?
  
   Have a good weekend,
   Tyler
 




Re: Broken build because of clirr plugin

2015-03-31 Thread Mattmann, Chris A (3980)
No worries Konstantin thank you! Thanks Tim!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Konstantin Gribov gros...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Monday, March 30, 2015 at 8:18 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: Broken build because of clirr plugin

I think, simple way would be to keep old methods (and mark them
@Deprecated) to avoid build failure. And use new ones internally.

I'll do `mvn verify` before commiting this time. Sorry for inconvenience.

-- 
Best regards,
Konstantin Gribov

пн, 30 марта 2015 г. в 18:09, Allison, Timothy B. talli...@mitre.org:

 How much of an effort would it be to migrate somewhat slowly:

 Leave in but deprecate setCommandLine(String ) and String
getCommandLine()

 Add something like: setCommandLineArr(String[] ) and String[]
 getCommandLineArr()?



 -Original Message-
 From: Konstantin Gribov [mailto:gros...@gmail.com]
 Sent: Monday, March 30, 2015 11:00 AM
 To: dev@tika.apache.org
 Subject: Broken build because of clirr plugin

 Hi, folks.

 I've broken build (by commit r1670105 for TIKA-1587).
 Should I revert this commit and change it to preserve old API or add
 exclude to clirr plugin configuration?

 --
 Best regards,
 Konstantin Gribov




Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-31 Thread Mattmann, Chris A (3980)
+1 to running tika-batch and govdocs. Woot.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Tyler Palsulich tpalsul...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Monday, March 30, 2015 at 3:22 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1

I just remembered TIKA-1509 and TIKA-1558 -- testing now for blacklist
functionality through TIKA-1509. If that works, I'll back out TIKA-1558.

Tim, I think you should run govdocs from the RC, in case something changes
between your run and the cut.

Tyler

On Mon, Mar 30, 2015 at 10:17 AM, Allison, Timothy B. talli...@mitre.org
wrote:

 All,

 I've made the changes that I had hoped to.  Grib pdf exclusion remains
for
 any takers.

 Let me know when I should initiate the run against govdocs1 to see if
 there are any surprises on that corpus with Tika 1.8.

 Best,

 Tim

 -Original Message-
 From: Allison, Timothy B. [mailto:talli...@mitre.org]
 Sent: Monday, March 30, 2015 7:03 AM
 To: dev@tika.apache.org
 Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1

 Unless there are objections, I'd like these to be resolved before 1.8:

 TIKA-1584 -- I'll fix
 TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
 TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
but
 I'll leave this open and do some more digging to see if we need to open
a
 ticket at the POI level
 TIKA-1511 -- I'll remove provided for xerial

 TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?

 I'll have these fixes completed by noon EDT.  Should I run against
 govdocs1 before or after the RC?

 My last build of Tika app (a few days ago) ballooned to ~43MB, and
that's
 before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
 build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
 README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
tika-server
 jars.

 Best,

   Tim



 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Sunday, March 29, 2015 9:13 AM
 To: dev@tika.apache.org
 Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1

 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
 something else pops up).

 Thank you everyone.

 Tyler
 On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com
wrote:

  +1 for 1.8
 
  Hong-Thai
 
   On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org
 wrote:
  
   Hi Folks,
  
   Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
need
 to
   release a new version of Tika. I'll volunteer to be the release
manager
   again.
  
   Should we release this as 1.8 or 1.7.1?
  
   Does anyone have any last minute issues they'd like to finish and
see
 in
   Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
and
   TIKA-1586). Any others?
  
   Have a good weekend,
   Tyler
 




Re: including refactored docs from govdocs1 in test suite

2015-03-31 Thread Mattmann, Chris A (3980)
+1 to including the modified docs.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Allison, Timothy B. talli...@mitre.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Monday, March 30, 2015 at 6:51 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: RE: including refactored docs from govdocs1 in test suite

I think this is an open question within Tika.  Some parsers prefer one
thing over another.  And there are different levels of corruption.

In the two cases where govdocs1 docs might be useful in tests, the
hyperlinks in .doc files do not appear to be standard, but  MSWord
opens them without a problem.  In cases where an application can open and
correctly process the content, I think we ought to try to extract content
without throwing exceptions.

-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
Sent: Monday, March 30, 2015 9:39 AM
To: dev@tika.apache.org
Subject: RE: including refactored docs from govdocs1 in test suite

Ah. I see.

In general, what is the goal with handling corrupted files? Extract as
much
as possible and fail gracefully?

Tyler

On Mar 30, 2015 9:32 AM, Allison, Timothy B. talli...@mitre.org wrote:

 Unfortunately, no.  MSOffice fixes the document when I do that.

 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Monday, March 30, 2015 9:24 AM
 To: dev@tika.apache.org
 Subject: Re: including refactored docs from govdocs1 in test suite

 Can you copy the hyperlink into a new doc and change the URL? I have no
 idea about including the modified version.

 Tyler
 On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org
wrote:

  All,
 
As part of TIKA-1512, I found that I can delete all of the contents,
  including the metadata, except for one hyperlink in two documents from
  govdocs1 and still get the proper behavior -- fail before fix, work
after
  fix.
 
These documents are in the public domain.
 
Is it ok to include these modified documents in our test suite or
should
  I avoid inclusion?
 
Happy to avoid inclusion for the sake of a quick release of 1.8 and
then
  we have time to discuss/determine way ahead... unless the answer is
obvious.
 
   Best,
 
   Tim
 
  -Original Message-
  From: Allison, Timothy B. [mailto:talli...@mitre.org]
  Sent: Monday, March 30, 2015 7:03 AM
  To: dev@tika.apache.org
  Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
 
  Unless there are objections, I'd like these to be resolved before 1.8:
 
  TIKA-1584 -- I'll fix
  TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
  TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
but
  I'll leave this open and do some more digging to see if we need to
open
a
  ticket at the POI level
  TIKA-1511 -- I'll remove provided for xerial
 
  TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
 
  I'll have these fixes completed by noon EDT.  Should I run against
  govdocs1 before or after the RC?
 
  My last build of Tika app (a few days ago) ballooned to ~43MB, and
that's
  before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my
last
  build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
  README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
tika-server
  jars.
 
  Best,
 
Tim
 
 
 
  -Original Message-
  From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
  Sent: Sunday, March 29, 2015 9:13 AM
  To: dev@tika.apache.org
  Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
 
  Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
  something else pops up).
 
  Thank you everyone.
 
  Tyler
  On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com
wrote:
 
   +1 for 1.8
  
   Hong-Thai
  
On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org
  wrote:
   
Hi Folks,
   
Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
need
  to
release a new version of Tika. I'll volunteer to be the release
manager
again.
   
Should we release this as 1.8 or 1.7.1?
   
Does anyone have any last minute issues they'd like to finish and
see
  in
Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
and
TIKA-1586). Any others?
   
Have a good weekend,
Tyler
  
 



Re: svn commit: r1670135 - /tika/trunk/CHANGES.txt

2015-03-31 Thread Mattmann, Chris A (3980)
Thanks Ken! :)

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: thaicha...@apache.org thaicha...@apache.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Monday, March 30, 2015 at 9:05 AM
To: comm...@tika.apache.org comm...@tika.apache.org
Subject: svn commit: r1670135 - /tika/trunk/CHANGES.txt

Author: thaichat04
Date: Mon Mar 30 16:05:17 2015
New Revision: 1670135

URL: http://svn.apache.org/r1670135
Log:
TIKA-1581 - Mention @kkrugler thanks in CHANGES.txt

Modified:
tika/trunk/CHANGES.txt

Modified: tika/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/tika/trunk/CHANGES.txt?rev=1670135r1=1670134
r2=1670135view=diff
==

--- tika/trunk/CHANGES.txt (original)
+++ tika/trunk/CHANGES.txt Mon Mar 30 16:05:17 2015
@@ -8,7 +8,8 @@ Release 1.8 - Current Development
   * Tika server can now enable CORS requests with the command line
 --cors option (TIKA-1586).
 
-  * Update jhighlight dependency to avoid using LGPL license (TIKA-1581)
+  * Update jhighlight dependency to avoid using LGPL license (TIKA-1581).
+  Thank @kkrugler for his great contribution
   
   * Updated HDF and NetCDF parsers to output file version in
 metadata (TIKA-1578 and TIKA-1579).





[GitHub] tika pull request: Refactor TIKA-1558. Remove service loading blac...

2015-03-31 Thread tpalsulich
GitHub user tpalsulich opened a pull request:

https://github.com/apache/tika/pull/39

Refactor TIKA-1558. Remove service loading blacklist

* Remove all direct service loading logic regarding a blacklist.
* Small changes to CompositeParser logic to make sure subclasses of 
excluded Parsers are also excluded.
* Added new testing in the tika-core module to test regular and subclass 
exclusion.

@Gagravarr, can you look this over?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tpalsulich/tika TIKA-1558

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/39.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #39


commit 7e38e3cdef3f5ae11d45863c67c6216561802a32
Author: Tyler Palsulich tpalsul...@gmail.com
Date:   2015-03-31T17:05:19Z

Refactor TIKA-1558. Remove service loading blacklist and ensure subclasses 
are also excluded.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] tika pull request: fix for TIKA-1589 contributed by mdaniline

2015-03-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/38


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags

2015-03-31 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388505#comment-14388505
 ] 

Nick Burch commented on TIKA-1589:
--

Applied with small tweaks in r1670330.

(You seem to have slightly different import-formatting rules to everyone else, 
might be worth double checking that before you next patch)

Thanks!

 Mp3 parser does not add duration to metadata if there are no ID3 tags
 -

 Key: TIKA-1589
 URL: https://issues.apache.org/jira/browse/TIKA-1589
 Project: Tika
  Issue Type: Bug
Reporter: Max Daniline

 Steps to reproduce:
 * Have a file without any ID3 tags (v1 or v2)
 * Parse the file
 * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'.
 Expected result:
 The duration should be set even for a file without ID3 tags, since it is 
 independent information.
 Actual result:
 The duration is null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags

2015-03-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388504#comment-14388504
 ] 

ASF GitHub Bot commented on TIKA-1589:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/38


 Mp3 parser does not add duration to metadata if there are no ID3 tags
 -

 Key: TIKA-1589
 URL: https://issues.apache.org/jira/browse/TIKA-1589
 Project: Tika
  Issue Type: Bug
Reporter: Max Daniline

 Steps to reproduce:
 * Have a file without any ID3 tags (v1 or v2)
 * Parse the file
 * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'.
 Expected result:
 The duration should be set even for a file without ID3 tags, since it is 
 independent information.
 Actual result:
 The duration is null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags

2015-03-31 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1589.
--
   Resolution: Fixed
Fix Version/s: 1.8

 Mp3 parser does not add duration to metadata if there are no ID3 tags
 -

 Key: TIKA-1589
 URL: https://issues.apache.org/jira/browse/TIKA-1589
 Project: Tika
  Issue Type: Bug
Reporter: Max Daniline
 Fix For: 1.8


 Steps to reproduce:
 * Have a file without any ID3 tags (v1 or v2)
 * Parse the file
 * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'.
 Expected result:
 The duration should be set even for a file without ID3 tags, since it is 
 independent information.
 Actual result:
 The duration is null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-03-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389656#comment-14389656
 ] 

Hudson commented on TIKA-1558:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #592 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/592/])
TIKA-1558. Better error message and fix typo. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670490)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/config/TikaParserConfigTest.java
TIKA-1558. Refactor Parser blacklisting. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670487)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java
* 
/tika/trunk/tika-core/src/test/java/org/apache/tika/parser/BlacklistedParser.java
* 
/tika/trunk/tika-core/src/test/java/org/apache/tika/parser/BlacklistedParserSubclass.java
* 
/tika/trunk/tika-core/src/test/java/org/apache/tika/parser/BlacklistedParserTest.java
* /tika/trunk/tika-core/src/test/resources/META-INF
* 
/tika/trunk/tika-core/src/test/resources/org/apache/tika/parser/blacklist2_file.blacklist2
* 
/tika/trunk/tika-core/src/test/resources/org/apache/tika/parser/blacklist_file.blacklist
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/config/TikaParserConfigTest.java
* 
/tika/trunk/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-1558-blacklistsub.xml


 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 -So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1558) Create a Parser Blacklist

2015-03-31 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1558:
--
Description: 
As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
disable Parsers without pulling their dependencies out. In some cases (e.g. 
disable all ExternalParsers), there may not be an easy way to exclude the 
dependencies via Maven.

-So, an initial design would be to include another file like 
{{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new 
method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
{{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that 
are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.-

  was:
As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
disable Parsers without pulling their dependencies out. In some cases (e.g. 
disable all ExternalParsers), there may not be an easy way to exclude the 
dependencies via Maven.

So, an initial design would be to include another file like 
{{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new 
method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
{{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that 
are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.


 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 -So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1558) Create a Parser Blacklist

2015-03-31 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1432#comment-1432
 ] 

Tyler Palsulich edited comment on TIKA-1558 at 3/31/15 9:41 PM:


-Above strategy added in r1661284. You can now blacklist Parsers by adding 
names to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the 
same format as the normal services file. If a class is blacklisted, all of its 
subclasses are automatically blacklisted.-

Edit: Service loading blacklisting disabled in r1670487. Use a custom 
TikaConfig like [this 
one|https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-1558-blacklistsub.xml]
 to disable a Parser. Any subclasses of that Parser will also be excluded.


was (Author: tpalsulich):
Above strategy added in r1661284. You can now blacklist Parsers by adding names 
to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the same 
format as the normal services file. If a class is blacklisted, all of its 
subclasses are automatically blacklisted.

 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 -So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-03-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389464#comment-14389464
 ] 

ASF GitHub Bot commented on TIKA-1558:
--

Github user tpalsulich closed the pull request at:

https://github.com/apache/tika/pull/39


 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)