[jira] [Commented] (TIKA-1676) Fix logic error in batch driver that prevents correct restarting of child process
[ https://issues.apache.org/jira/browse/TIKA-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621954#comment-14621954 ] Hudson commented on TIKA-1676: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #785 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/785/]) TIKA-1676 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1690090) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java Fix logic error in batch driver that prevents correct restarting of child process - Key: TIKA-1676 URL: https://issues.apache.org/jira/browse/TIKA-1676 Project: Tika Issue Type: Bug Affects Versions: 1.8, 1.9 Reporter: Tim Allison Fix For: 1.10 Thanks to work on TIKA-1285, I discovered a logic bug in the driver process that prevents correct restarting of the child process. This should only happen under very heavy load, but this needs to be fixed asap. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1285: -- Attachment: pdfbox_reports_2_0_0_20150709.zip First dump of stack traces in govdocs1 from the integration with pdfbox 2.0.0. Notes: I stopped the batch run early. This only covered ~50k pdfs. I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped. I haven't reviewed any of the exceptions. They may be caused by code on the Tika side. Upgrade to PDFBox 2.0.0 when available -- Key: TIKA-1285 URL: https://issues.apache.org/jira/browse/TIKA-1285 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Jeremy Anderson Priority: Minor Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, testPDF_childAttachments.pdf This issue is to track fixes required when upgrading the PDFbox dependency to 2.0.0 Final once it's available, and using PDFBox's daily build before then. See TIKA-1268 comment. Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1637) Oracle internal API jdeps request for information
[ https://issues.apache.org/jira/browse/TIKA-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1637: -- Attachment: tika-1.9.jdeps.txt JDeps output for the Tika output JAR artefacts. Oracle internal API jdeps request for information - Key: TIKA-1637 URL: https://issues.apache.org/jira/browse/TIKA-1637 Project: Tika Issue Type: Task Reporter: Dave Meikle Assignee: Dave Meikle Priority: Trivial Attachments: tika-1.9.jdeps.txt We have been asked to provide information to Oracle around the internal API usage in Apache Tika to support move to JDK 9, which contains significant changes. {quote} Hi David, My name is Rory O'Donnell, I am the OpenJDK Quality Group Lead. I'm contacting you because your open source project seems to be a very popular dependency for other open source projects. As part of the preparations for JDK 9, Oracle’s engineers have been analyzing open source projects like yours to understand usage. One area of concern involves identifying compatibility problems, such as reliance on JDK-internal APIs. Our engineers have already prepared guidance on migrating some of the more common usage patterns of JDK-internal APIs to supported public interfaces. The list is on the OpenJDK wiki [0]. As part of the ongoing development of JDK 9, I would like to inquire about your usage of JDK-internal APIs and to encourage migration towards supported Java APIs if necessary. The first step is to identify if your application(s) is leveraging internal APIs. Step 1: Download JDeps. Just download a preview release of JDK8(JDeps Download). You do not need to actually test or run your application on JDK8. JDeps(Docs) looks through JAR files and identifies which JAR files use internal APIs and then lists those APIs. Step 2: To run JDeps against an application. The command looks like: jdk8/bin/jdeps -P -jdkinternals *.jar your-application.jdeps.txt The output inside your-application.jdeps.txt will look like: your.package (Filename.jar) - com.sun.corba.seJDK internal API (rt.jar) 3rd party library using Internal APIs: If your analysis uncovers a third-party component that you rely on, you can contact the provider and let them know of the upcoming changes. You can then either work with the provider to get an updated library that won't rely on Internal APIs, or you can find an alternative provider for the capabilities that the offending library provides. Dynamic use of Internal APIs: JDeps can not detect dynamic use of internal APIs, for example through reflection, service loaders and similar mechanisms. Rgds,Rory [0] https://wiki.openjdk.java.net/display/JDK8/Java+Dependency+Analysis+Tool -- Rgds,Rory O'Donnell Quality Engineering Manager Oracle EMEA , Dublin, Ireland {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API
[ https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623246#comment-14623246 ] Tyler Palsulich commented on TIKA-1362: --- If you have a pressing need for better configuration abilities for the Google Translator, feel free to open up a new issue and upload a patch! :) We'd be happy to help you get started. Check out the [contributing page|https://tika.apache.org/contribute.html] for some general information. Add GoogleTranslate implementation of Translation API - Key: TIKA-1362 URL: https://issues.apache.org/jira/browse/TIKA-1362 Project: Tika Issue Type: Bug Components: translation Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.6 Add an implementation of the Translation API that uses the Google Translate v2 API and Apache CXF: https://www.googleapis.com/language/translate/v2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822
[ https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622663#comment-14622663 ] Chris A. Mattmann commented on TIKA-1602: - Got it, [~jeremybmerrill] - can you open a new Pull request and JIRA issue and send em' along? Detecting standards-non-compliant emails as message/rfc822 -- Key: TIKA-1602 URL: https://issues.apache.org/jira/browse/TIKA-1602 Project: Tika Issue Type: New Feature Components: mime Reporter: Jeremy B. Merrill Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.10 Attachments: 036491.txt.zip Original Estimate: 1h Remaining Estimate: 1h Tika does not properly detect certain emails as `message/rfc822` if they're slightly standards-non-compliant and begin with `Status: ` as the first header. I've added `Status: ` as a magic detection line in tika-mimetypes.xml. This solves my problem and does not appear to cause unit test failures. I have not yet run the tika-batch tests. As further information, the emails that are processed incorrectly come from dumps directly from various US public officials' mailservers. The dumps, I believe since they're not intended to be transmitted over the wire, sometimes are slightly non-compliant. It's important to note that Tika (and the underlying library, James Mime4J) do properly *parse* these emails, despite the non-compliant header. The problem is getting Tika to *detect* the file as an email so that Mime4J gets chosen to parse it. Pull request on Github at https://github.com/apache/tika/pull/40 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1637) Oracle internal API jdeps request for information
[ https://issues.apache.org/jira/browse/TIKA-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1637: -- Attachment: tika-server-1.9.jdeps.txt tika-app-1.9.jdeps.txt jdeps output for tika-1.9-app and tika-1.9-server JARs Oracle internal API jdeps request for information - Key: TIKA-1637 URL: https://issues.apache.org/jira/browse/TIKA-1637 Project: Tika Issue Type: Task Reporter: Dave Meikle Assignee: Dave Meikle Priority: Trivial Attachments: tika-1.9.jdeps.txt, tika-app-1.9.jdeps.txt, tika-server-1.9.jdeps.txt We have been asked to provide information to Oracle around the internal API usage in Apache Tika to support move to JDK 9, which contains significant changes. {quote} Hi David, My name is Rory O'Donnell, I am the OpenJDK Quality Group Lead. I'm contacting you because your open source project seems to be a very popular dependency for other open source projects. As part of the preparations for JDK 9, Oracle’s engineers have been analyzing open source projects like yours to understand usage. One area of concern involves identifying compatibility problems, such as reliance on JDK-internal APIs. Our engineers have already prepared guidance on migrating some of the more common usage patterns of JDK-internal APIs to supported public interfaces. The list is on the OpenJDK wiki [0]. As part of the ongoing development of JDK 9, I would like to inquire about your usage of JDK-internal APIs and to encourage migration towards supported Java APIs if necessary. The first step is to identify if your application(s) is leveraging internal APIs. Step 1: Download JDeps. Just download a preview release of JDK8(JDeps Download). You do not need to actually test or run your application on JDK8. JDeps(Docs) looks through JAR files and identifies which JAR files use internal APIs and then lists those APIs. Step 2: To run JDeps against an application. The command looks like: jdk8/bin/jdeps -P -jdkinternals *.jar your-application.jdeps.txt The output inside your-application.jdeps.txt will look like: your.package (Filename.jar) - com.sun.corba.seJDK internal API (rt.jar) 3rd party library using Internal APIs: If your analysis uncovers a third-party component that you rely on, you can contact the provider and let them know of the upcoming changes. You can then either work with the provider to get an updated library that won't rely on Internal APIs, or you can find an alternative provider for the capabilities that the offending library provides. Dynamic use of Internal APIs: JDeps can not detect dynamic use of internal APIs, for example through reflection, service loaders and similar mechanisms. Rgds,Rory [0] https://wiki.openjdk.java.net/display/JDK8/Java+Dependency+Analysis+Tool -- Rgds,Rory O'Donnell Quality Engineering Manager Oracle EMEA , Dublin, Ireland {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1300. --- Resolution: Duplicate Fix Version/s: (was: 1.10) It looks like PDFBox 2.0.0 is coming soon. Let's skip this half step and rip the bandaid off with one go when PDFBox 2.0.0 is available. Switch default PDFBox parser to NonSequentialParser --- Key: TIKA-1300 URL: https://issues.apache.org/jira/browse/TIKA-1300 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Minor Attachments: tika_1_6_ClassicsVsNonSeq.zip On TIKA-1298, [~tilman] recommended switching Tika's default to the NonSequentialParser. We added a parameter to use the NonSequentialParser in TIKA-1201, and there's some good discussion there about the benefits. Is the community in favor of switching the default now? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API
[ https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622982#comment-14622982 ] Tristan Nixon commented on TIKA-1362: - Storing the API key in the properties file is very cumbersome and difficult to use. Especially as this is not documented well (I had to dig into the source to figure out the package path and property key name). Why not just provide a constructor argument or setAPIKey( String key ) method? This is how the MicrosoftTranslator works. At the least some consistency across implementations and some improved documentation would be very much appreciated. Thanks! Add GoogleTranslate implementation of Translation API - Key: TIKA-1362 URL: https://issues.apache.org/jira/browse/TIKA-1362 Project: Tika Issue Type: Bug Components: translation Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.6 Add an implementation of the Translation API that uses the Google Translate v2 API and Apache CXF: https://www.googleapis.com/language/translate/v2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API
[ https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622994#comment-14622994 ] Nick Burch commented on TIKA-1362: -- [~tnixon] We're currently debating the best way to handle configuring those parsers and translators which need it. Once we have a conclusion, we'll be able to unify things across all the translators. If you'd like to join in the discussions, the thread on dev@tika has the subject Configuring parsers and translators Add GoogleTranslate implementation of Translation API - Key: TIKA-1362 URL: https://issues.apache.org/jira/browse/TIKA-1362 Project: Tika Issue Type: Bug Components: translation Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.6 Add an implementation of the Translation API that uses the Google Translate v2 API and Apache CXF: https://www.googleapis.com/language/translate/v2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API
[ https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622996#comment-14622996 ] Chris A. Mattmann commented on TIKA-1362: - What Nick said- we're improving this and please join the discussion thanks Add GoogleTranslate implementation of Translation API - Key: TIKA-1362 URL: https://issues.apache.org/jira/browse/TIKA-1362 Project: Tika Issue Type: Bug Components: translation Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.6 Add an implementation of the Translation API that uses the Google Translate v2 API and Apache CXF: https://www.googleapis.com/language/translate/v2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API
[ https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622999#comment-14622999 ] Tristan Nixon commented on TIKA-1362: - Great to hear, and thanks for the invite. I'm new to using Tika, but finding it immensely useful. I'd be happy to contribute in whatever way I can. Add GoogleTranslate implementation of Translation API - Key: TIKA-1362 URL: https://issues.apache.org/jira/browse/TIKA-1362 Project: Tika Issue Type: Bug Components: translation Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.6 Add an implementation of the Translation API that uses the Google Translate v2 API and Apache CXF: https://www.googleapis.com/language/translate/v2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)