[jira] [Commented] (TIKA-1676) Fix logic error in batch driver that prevents correct restarting of child process

2015-07-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621954#comment-14621954
 ] 

Hudson commented on TIKA-1676:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #785 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/785/])
TIKA-1676 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1690090)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java


 Fix logic error in batch driver that prevents correct restarting of child 
 process
 -

 Key: TIKA-1676
 URL: https://issues.apache.org/jira/browse/TIKA-1676
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.8, 1.9
Reporter: Tim Allison
 Fix For: 1.10


 Thanks to work on TIKA-1285, I discovered a logic bug in the driver process 
 that prevents correct restarting of the child process.  This should only 
 happen under very heavy load, but this needs to be fixed asap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-07-10 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1285:
--
Attachment: pdfbox_reports_2_0_0_20150709.zip

First dump of stack traces in govdocs1 from the integration with pdfbox 2.0.0.

Notes:
I stopped the batch run early.  This only covered ~50k pdfs.

I forgot to turn on accesspermission checking.  Some of the pdfs in here would 
normally have been skipped.

I haven't reviewed any of the exceptions.  They may be caused by code on the 
Tika side.


 Upgrade to PDFBox 2.0.0 when available
 --

 Key: TIKA-1285
 URL: https://issues.apache.org/jira/browse/TIKA-1285
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Jeremy Anderson
Priority: Minor
 Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
 TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
 testPDF_childAttachments.pdf


 This issue is to track fixes required when upgrading the PDFbox dependency to 
 2.0.0 Final once it's available, and using PDFBox's daily build before then.
 See TIKA-1268 comment.
 Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1637) Oracle internal API jdeps request for information

2015-07-10 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1637:
--
Attachment: tika-1.9.jdeps.txt

JDeps output for the Tika output JAR artefacts. 

 Oracle internal API jdeps request for information
 -

 Key: TIKA-1637
 URL: https://issues.apache.org/jira/browse/TIKA-1637
 Project: Tika
  Issue Type: Task
Reporter: Dave Meikle
Assignee: Dave Meikle
Priority: Trivial
 Attachments: tika-1.9.jdeps.txt


 We have been asked to provide information to Oracle around the internal API 
 usage in Apache Tika to support move to JDK 9, which contains significant 
 changes.
 {quote}
 Hi David,
 My name is Rory O'Donnell, I am the OpenJDK Quality Group Lead.  
 I'm contacting you because your open source project seems to be a very 
 popular dependency for other open source projects.
 As part of the preparations for JDK 9, Oracle’s engineers have been analyzing 
 open source projects like yours to understand usage. One area of concern 
 involves identifying compatibility problems, such as reliance on JDK-internal 
 APIs. 
 Our engineers have already prepared guidance on migrating some of the more 
 common usage patterns of JDK-internal APIs to supported public interfaces.  
 The list is on the OpenJDK wiki [0].
 As part of the ongoing development of JDK 9, I would like to inquire about 
 your usage of  JDK-internal APIs and to encourage migration towards supported 
 Java APIs if necessary.
 The first step is to identify if your application(s) is leveraging internal 
 APIs. 
   Step 1: Download JDeps. 
 Just download a preview release of JDK8(JDeps Download). You do not need to 
 actually test or run your application on JDK8.  JDeps(Docs) looks through JAR 
 files and identifies which JAR files use internal APIs and then lists those 
 APIs.
   Step 2: To run JDeps against an application. The command looks like:
 jdk8/bin/jdeps -P -jdkinternals *.jar  your-application.jdeps.txt
 The output inside your-application.jdeps.txt will look like:
 your.package (Filename.jar)
   - com.sun.corba.seJDK internal API (rt.jar)
 3rd party library using Internal APIs:
 If your analysis uncovers a third-party component that you rely on, you can 
 contact the provider and let them know of the upcoming changes. You can then 
 either work with the provider to get an updated library that won't rely on 
 Internal APIs, or you can find an alternative provider for the capabilities 
 that the offending library provides.
 Dynamic use of Internal APIs:
 JDeps can not detect dynamic use of internal APIs, for example through 
 reflection, service loaders and similar mechanisms.
 Rgds,Rory 
 [0] https://wiki.openjdk.java.net/display/JDK8/Java+Dependency+Analysis+Tool
 -- 
 Rgds,Rory O'Donnell
 Quality Engineering Manager
 Oracle EMEA , Dublin, Ireland 
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API

2015-07-10 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623246#comment-14623246
 ] 

Tyler Palsulich commented on TIKA-1362:
---

If you have a pressing need for better configuration abilities for the Google 
Translator, feel free to open up a new issue and upload a patch! :) We'd be 
happy to help you get started. Check out the [contributing 
page|https://tika.apache.org/contribute.html] for some general information.

 Add GoogleTranslate implementation of Translation API
 -

 Key: TIKA-1362
 URL: https://issues.apache.org/jira/browse/TIKA-1362
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6


 Add an implementation of the Translation API that uses the Google Translate 
 v2 API and Apache CXF: 
 https://www.googleapis.com/language/translate/v2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822

2015-07-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622663#comment-14622663
 ] 

Chris A. Mattmann commented on TIKA-1602:
-

Got it, [~jeremybmerrill] - can you open a new Pull request and JIRA issue and 
send em' along?

 Detecting standards-non-compliant emails as message/rfc822
 --

 Key: TIKA-1602
 URL: https://issues.apache.org/jira/browse/TIKA-1602
 Project: Tika
  Issue Type: New Feature
  Components: mime
Reporter: Jeremy B. Merrill
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.10

 Attachments: 036491.txt.zip

   Original Estimate: 1h
  Remaining Estimate: 1h

 Tika does not properly detect certain emails as `message/rfc822` if they're 
 slightly standards-non-compliant and begin with `Status: ` as the first 
 header. I've added `Status: ` as a magic detection line in 
 tika-mimetypes.xml. 
 This solves my problem and does not appear to cause unit test failures. I 
 have not yet run the tika-batch tests.
 As further information, the emails that are processed incorrectly come from 
 dumps directly from various US public officials' mailservers. The dumps, I 
 believe since they're not intended to be transmitted over the wire, sometimes 
 are slightly non-compliant. 
 It's important to note that Tika (and the underlying library, James Mime4J) 
 do properly *parse* these emails, despite the non-compliant header. The 
 problem is getting Tika to *detect* the file as an email so that Mime4J gets 
 chosen to parse it.
 Pull request on Github at https://github.com/apache/tika/pull/40



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1637) Oracle internal API jdeps request for information

2015-07-10 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1637:
--
Attachment: tika-server-1.9.jdeps.txt
tika-app-1.9.jdeps.txt

jdeps output for tika-1.9-app and tika-1.9-server JARs

 Oracle internal API jdeps request for information
 -

 Key: TIKA-1637
 URL: https://issues.apache.org/jira/browse/TIKA-1637
 Project: Tika
  Issue Type: Task
Reporter: Dave Meikle
Assignee: Dave Meikle
Priority: Trivial
 Attachments: tika-1.9.jdeps.txt, tika-app-1.9.jdeps.txt, 
 tika-server-1.9.jdeps.txt


 We have been asked to provide information to Oracle around the internal API 
 usage in Apache Tika to support move to JDK 9, which contains significant 
 changes.
 {quote}
 Hi David,
 My name is Rory O'Donnell, I am the OpenJDK Quality Group Lead.  
 I'm contacting you because your open source project seems to be a very 
 popular dependency for other open source projects.
 As part of the preparations for JDK 9, Oracle’s engineers have been analyzing 
 open source projects like yours to understand usage. One area of concern 
 involves identifying compatibility problems, such as reliance on JDK-internal 
 APIs. 
 Our engineers have already prepared guidance on migrating some of the more 
 common usage patterns of JDK-internal APIs to supported public interfaces.  
 The list is on the OpenJDK wiki [0].
 As part of the ongoing development of JDK 9, I would like to inquire about 
 your usage of  JDK-internal APIs and to encourage migration towards supported 
 Java APIs if necessary.
 The first step is to identify if your application(s) is leveraging internal 
 APIs. 
   Step 1: Download JDeps. 
 Just download a preview release of JDK8(JDeps Download). You do not need to 
 actually test or run your application on JDK8.  JDeps(Docs) looks through JAR 
 files and identifies which JAR files use internal APIs and then lists those 
 APIs.
   Step 2: To run JDeps against an application. The command looks like:
 jdk8/bin/jdeps -P -jdkinternals *.jar  your-application.jdeps.txt
 The output inside your-application.jdeps.txt will look like:
 your.package (Filename.jar)
   - com.sun.corba.seJDK internal API (rt.jar)
 3rd party library using Internal APIs:
 If your analysis uncovers a third-party component that you rely on, you can 
 contact the provider and let them know of the upcoming changes. You can then 
 either work with the provider to get an updated library that won't rely on 
 Internal APIs, or you can find an alternative provider for the capabilities 
 that the offending library provides.
 Dynamic use of Internal APIs:
 JDeps can not detect dynamic use of internal APIs, for example through 
 reflection, service loaders and similar mechanisms.
 Rgds,Rory 
 [0] https://wiki.openjdk.java.net/display/JDK8/Java+Dependency+Analysis+Tool
 -- 
 Rgds,Rory O'Donnell
 Quality Engineering Manager
 Oracle EMEA , Dublin, Ireland 
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2015-07-10 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1300.
---
   Resolution: Duplicate
Fix Version/s: (was: 1.10)

It looks like PDFBox 2.0.0 is coming soon.  Let's skip this half step and rip 
the bandaid off with one go when PDFBox 2.0.0 is available.

 Switch default PDFBox parser to NonSequentialParser
 ---

 Key: TIKA-1300
 URL: https://issues.apache.org/jira/browse/TIKA-1300
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Attachments: tika_1_6_ClassicsVsNonSeq.zip


 On TIKA-1298, [~tilman] recommended switching Tika's default to the 
 NonSequentialParser. We added a parameter to use the NonSequentialParser in 
 TIKA-1201, and there's some good discussion there about the benefits.
 Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API

2015-07-10 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622982#comment-14622982
 ] 

Tristan Nixon commented on TIKA-1362:
-

Storing the API key in the properties file is very cumbersome and difficult to 
use. Especially as this is not documented well (I had to dig into the source to 
figure out the package path and property key name). Why not just provide a 
constructor argument or setAPIKey( String key ) method? This is how the 
MicrosoftTranslator works. At the least some consistency across implementations 
and some improved documentation would be very much appreciated. Thanks!

 Add GoogleTranslate implementation of Translation API
 -

 Key: TIKA-1362
 URL: https://issues.apache.org/jira/browse/TIKA-1362
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6


 Add an implementation of the Translation API that uses the Google Translate 
 v2 API and Apache CXF: 
 https://www.googleapis.com/language/translate/v2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API

2015-07-10 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622994#comment-14622994
 ] 

Nick Burch commented on TIKA-1362:
--

[~tnixon] We're currently debating the best way to handle configuring those 
parsers and translators which need it. Once we have a conclusion, we'll be able 
to unify things across all the translators. If you'd like to join in the 
discussions, the thread on dev@tika has the subject Configuring parsers and 
translators

 Add GoogleTranslate implementation of Translation API
 -

 Key: TIKA-1362
 URL: https://issues.apache.org/jira/browse/TIKA-1362
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6


 Add an implementation of the Translation API that uses the Google Translate 
 v2 API and Apache CXF: 
 https://www.googleapis.com/language/translate/v2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API

2015-07-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622996#comment-14622996
 ] 

Chris A. Mattmann commented on TIKA-1362:
-

What Nick said- we're improving this and please join the discussion thanks

 Add GoogleTranslate implementation of Translation API
 -

 Key: TIKA-1362
 URL: https://issues.apache.org/jira/browse/TIKA-1362
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6


 Add an implementation of the Translation API that uses the Google Translate 
 v2 API and Apache CXF: 
 https://www.googleapis.com/language/translate/v2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API

2015-07-10 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622999#comment-14622999
 ] 

Tristan Nixon commented on TIKA-1362:
-

Great to hear, and thanks for the invite. I'm new to using Tika, but finding it 
immensely useful. I'd be happy to contribute in whatever way I can.

 Add GoogleTranslate implementation of Translation API
 -

 Key: TIKA-1362
 URL: https://issues.apache.org/jira/browse/TIKA-1362
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6


 Add an implementation of the Translation API that uses the Google Translate 
 v2 API and Apache CXF: 
 https://www.googleapis.com/language/translate/v2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)