building tika from scratch without pulling 1.5-SNAPSHOT from the repository?
All, Is there an easy way to build Tika from scratch without reliance on 1.5-SNAPSHOT in the mvn repository and without building the components in the correct order and then manually loading them into a local mvn repository? At the main level, I've been using a simple 'mvn package' Thank you. Best, Tim Details below... I modified the poi version to beta2 in the pom under parsers. When I package tika, mvn pulls beta2 from the repository when building parsers, and all is well. However, when mvn gets to building the xmp module, it pulls beta1, and beta1 is winding up in the tika-app.jar. I manually removed beta1 from my local repository and tried to package with -o (offline) and I got the error below. It looks like mvn is pulling 1.5-SNAPSHOT from the maven repository and not from the local build. [INFO] Copying 3 resources [INFO] [ERROR] BUILD ERROR [INFO] [INFO] Failed to resolve artifact. Missing: -- 1) org.apache.poi:poi:jar:3.10-beta1 Try downloading the file manually from the project website. Then, install it using the command: mvn install:install-file -DgroupId=org.apache.poi -DartifactId=poi -Dversion=3.10-beta1 -Dpackaging=jar -Dfile=/path/to/file Alternatively, if you host your own repository you can deploy the file there: mvn deploy:deploy-file -DgroupId=org.apache.poi -DartifactId=poi -Dversion=3.10-beta1 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id] Path to dependency: 1) org.apache.tika:tika-xmp:bundle:1.5-SNAPSHOT 2) org.apache.tika:tika-parsers:jar:1.5-SNAPSHOT 3) org.apache.poi:poi:jar:3.10-beta1
RE: building tika from scratch without pulling 1.5-SNAPSHOT from the repository?
Patience is an option, I guess. :) Sorry of the non-information for those who know, but I learned that a commit kicks off Jenkins, and it looks like a nightly build in Jenkins makes it into a maven repository (https://issues.apache.org/jira/browse/TIKA-162). -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, September 19, 2013 2:20 PM To: dev@tika.apache.org Subject: building tika from scratch without pulling 1.5-SNAPSHOT from the repository? All, Is there an easy way to build Tika from scratch without reliance on 1.5-SNAPSHOT in the mvn repository and without building the components in the correct order and then manually loading them into a local mvn repository? At the main level, I've been using a simple 'mvn package' Thank you. Best, Tim Details below... I modified the poi version to beta2 in the pom under parsers. When I package tika, mvn pulls beta2 from the repository when building parsers, and all is well. However, when mvn gets to building the xmp module, it pulls beta1, and beta1 is winding up in the tika-app.jar. I manually removed beta1 from my local repository and tried to package with -o (offline) and I got the error below. It looks like mvn is pulling 1.5-SNAPSHOT from the maven repository and not from the local build. [INFO] Copying 3 resources [INFO] [ERROR] BUILD ERROR [INFO] [INFO] Failed to resolve artifact. Missing: -- 1) org.apache.poi:poi:jar:3.10-beta1 Try downloading the file manually from the project website. Then, install it using the command: mvn install:install-file -DgroupId=org.apache.poi -DartifactId=poi -Dversion=3.10-beta1 -Dpackaging=jar -Dfile=/path/to/file Alternatively, if you host your own repository you can deploy the file there: mvn deploy:deploy-file -DgroupId=org.apache.poi -DartifactId=poi -Dversion=3.10-beta1 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id] Path to dependency: 1) org.apache.tika:tika-xmp:bundle:1.5-SNAPSHOT 2) org.apache.tika:tika-parsers:jar:1.5-SNAPSHOT 3) org.apache.poi:poi:jar:3.10-beta1
[jira] [Reopened] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1132: --- Assignee: Tim Allison Will add test case in Tika. Parsing some XLS documents hangs entire JVM, requires kill -9 - Key: TIKA-1132 URL: https://issues.apache.org/jira/browse/TIKA-1132 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2, 1.3 Environment: Linux Suse: java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) OSX 10.8.3: java version 1.7.0_06 Java(TM) SE Runtime Environment (build 1.7.0_06-b24) Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) Reporter: Ryan Krueger Assignee: Tim Allison Fix For: 1.5 Attachments: mod3.xlsx, mod.xls Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same. Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls When running on multi-CPU machines there are two threads running at 100% every time. I have attached a document that triggers the error. I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1173) Upgrade to POI-3.10-beta2
Tim Allison created TIKA-1173: - Summary: Upgrade to POI-3.10-beta2 Key: TIKA-1173 URL: https://issues.apache.org/jira/browse/TIKA-1173 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1173) Upgrade to POI-3.10-beta2
[ https://issues.apache.org/jira/browse/TIKA-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1173. --- Resolution: Fixed Upgrade to POI-3.10-beta2 - Key: TIKA-1173 URL: https://issues.apache.org/jira/browse/TIKA-1173 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1174) Invalid characters in filtered PDF output
Matt Sheppard created TIKA-1174: --- Summary: Invalid characters in filtered PDF output Key: TIKA-1174 URL: https://issues.apache.org/jira/browse/TIKA-1174 Project: Tika Issue Type: Bug Environment: Mac OS X 10.8.5, Java 1.7u40 (but also seen on CentOS5) Reporter: Matt Sheppard Priority: Minor The PDF document at http://www.logan.qld.gov.au/__data/assets/pdf_file/0010/9496/map_sp_1a_a4.pdf produces invalid characters in the output when filtered by Tika 1.4. {noformat} /opt/funnelback/mbin/java/bin/java -jar tika-app-1.4.jar map_sp_1c_a4.pdf | hea… …d -n 40 ERROR - Error: Could not parse predefined CMAP file for 'nullžf °-ˇžl,¡ì$1-UCS2' ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head [snip] pCycle network /p p /p pHILEY /p {noformat} Is there any proper way to avoid this, or is the best approach to strip such characters from Tika's output? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1174) Invalid characters in filtered PDF output
[ https://issues.apache.org/jira/browse/TIKA-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Sheppard updated TIKA-1174: Attachment: map_sp_1c_a4.pdf Attached a copy of the PDF in question in case the site is changed. Invalid characters in filtered PDF output - Key: TIKA-1174 URL: https://issues.apache.org/jira/browse/TIKA-1174 Project: Tika Issue Type: Bug Environment: Mac OS X 10.8.5, Java 1.7u40 (but also seen on CentOS5) Reporter: Matt Sheppard Priority: Minor Attachments: map_sp_1c_a4.pdf The PDF document at http://www.logan.qld.gov.au/__data/assets/pdf_file/0010/9496/map_sp_1a_a4.pdf produces invalid characters in the output when filtered by Tika 1.4. {noformat} /opt/funnelback/mbin/java/bin/java -jar tika-app-1.4.jar map_sp_1c_a4.pdf | hea… …d -n 40 ERROR - Error: Could not parse predefined CMAP file for 'nullžf °-ˇžl,¡ì$1-UCS2' ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head [snip] pCycle network /p p /p pHILEY /p {noformat} Is there any proper way to avoid this, or is the best approach to strip such characters from Tika's output? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira