building tika from scratch without pulling 1.5-SNAPSHOT from the repository?

2013-09-19 Thread Allison, Timothy B.
All,
  Is there an easy way to build Tika from scratch without reliance on 
1.5-SNAPSHOT in the mvn repository and without building the components in the 
correct order and then manually loading them into a local mvn repository?
  At the main level, I've been using a simple 'mvn package'

   Thank you.

  Best,

 Tim


Details below...

  I modified the poi version to beta2 in the pom under parsers.  When I package 
tika, mvn pulls beta2 from the repository when building parsers, and all is 
well.

However, when mvn gets to building the xmp module, it pulls beta1, and beta1 is 
winding up in the tika-app.jar.

  I manually removed beta1 from my local repository and tried to package with 
-o (offline) and I got the error below.

  It looks like mvn is pulling 1.5-SNAPSHOT from the maven repository and not 
from the local build.

[INFO] Copying 3 resources
[INFO] 
[ERROR] BUILD ERROR
[INFO] 
[INFO] Failed to resolve artifact.

Missing:
--
1) org.apache.poi:poi:jar:3.10-beta1

  Try downloading the file manually from the project website.

  Then, install it using the command:
  mvn install:install-file -DgroupId=org.apache.poi -DartifactId=poi 
-Dversion=3.10-beta1 -Dpackaging=jar -Dfile=/path/to/file

  Alternatively, if you host your own repository you can deploy the file there:
  mvn deploy:deploy-file -DgroupId=org.apache.poi -DartifactId=poi 
-Dversion=3.10-beta1 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] 
-DrepositoryId=[id]

  Path to dependency:
1) org.apache.tika:tika-xmp:bundle:1.5-SNAPSHOT
2) org.apache.tika:tika-parsers:jar:1.5-SNAPSHOT
3) org.apache.poi:poi:jar:3.10-beta1




RE: building tika from scratch without pulling 1.5-SNAPSHOT from the repository?

2013-09-19 Thread Allison, Timothy B.
Patience is an option, I guess. :)

Sorry of the non-information for those who know, but I learned that a commit 
kicks off Jenkins, and it looks like a nightly build in Jenkins makes it into a 
maven repository (https://issues.apache.org/jira/browse/TIKA-162). 

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, September 19, 2013 2:20 PM
To: dev@tika.apache.org
Subject: building tika from scratch without pulling 1.5-SNAPSHOT from the 
repository?

All,
  Is there an easy way to build Tika from scratch without reliance on 
1.5-SNAPSHOT in the mvn repository and without building the components in the 
correct order and then manually loading them into a local mvn repository?
  At the main level, I've been using a simple 'mvn package'

   Thank you.

  Best,

 Tim


Details below...

  I modified the poi version to beta2 in the pom under parsers.  When I package 
tika, mvn pulls beta2 from the repository when building parsers, and all is 
well.

However, when mvn gets to building the xmp module, it pulls beta1, and beta1 is 
winding up in the tika-app.jar.

  I manually removed beta1 from my local repository and tried to package with 
-o (offline) and I got the error below.

  It looks like mvn is pulling 1.5-SNAPSHOT from the maven repository and not 
from the local build.

[INFO] Copying 3 resources
[INFO] 
[ERROR] BUILD ERROR
[INFO] 
[INFO] Failed to resolve artifact.

Missing:
--
1) org.apache.poi:poi:jar:3.10-beta1

  Try downloading the file manually from the project website.

  Then, install it using the command:
  mvn install:install-file -DgroupId=org.apache.poi -DartifactId=poi 
-Dversion=3.10-beta1 -Dpackaging=jar -Dfile=/path/to/file

  Alternatively, if you host your own repository you can deploy the file there:
  mvn deploy:deploy-file -DgroupId=org.apache.poi -DartifactId=poi 
-Dversion=3.10-beta1 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] 
-DrepositoryId=[id]

  Path to dependency:
1) org.apache.tika:tika-xmp:bundle:1.5-SNAPSHOT
2) org.apache.tika:tika-parsers:jar:1.5-SNAPSHOT
3) org.apache.poi:poi:jar:3.10-beta1




[jira] [Reopened] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-1132:
---

  Assignee: Tim Allison

Will add test case in Tika.

 Parsing some XLS documents hangs entire JVM, requires kill -9
 -

 Key: TIKA-1132
 URL: https://issues.apache.org/jira/browse/TIKA-1132
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2, 1.3
 Environment: Linux Suse:
 java version 1.7.0
 Java(TM) SE Runtime Environment (build 1.7.0-b147)
 Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
 OSX 10.8.3:
 java version 1.7.0_06
 Java(TM) SE Runtime Environment (build 1.7.0_06-b24)
 Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode)
Reporter: Ryan Krueger
Assignee: Tim Allison
 Fix For: 1.5

 Attachments: mod3.xlsx, mod.xls


 Some XLS documents hang the entire JVM.  A control-C or regular kill won't 
 stop the JVM, a kill -9 is required.
 We're running within an email server application parsing documents to extract 
 text of all attachments.  When we hit a message with the affected attachment 
 the entire JVM hangs and we mark the message to skip extracting the text from 
 the affected message the next attempt.  Unfortunately, it kills all email 
 processing on the server until the internal watchdogs kill -9 the application.
 We have seen the issue for several months with different documents, but they 
 are always Excel files.  Some get complaints from Excel when opening but not 
 all.
 In addition to experiencing the problem on our Linux servers I have tested on 
 OSX and experienced the same problems.  I ran the Tika UI and select the 
 affected file or run the CLI.  The problem is the same.
 Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls
 When running on multi-CPU machines there are two threads running at 100% 
 every time.
 I have attached a document that triggers the error.
 I have tested with 1.2 and 1.3 with the same result.  Running 1.1 the text is 
 accurately extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1173) Upgrade to POI-3.10-beta2

2013-09-19 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1173:
-

 Summary: Upgrade to POI-3.10-beta2
 Key: TIKA-1173
 URL: https://issues.apache.org/jira/browse/TIKA-1173
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1173) Upgrade to POI-3.10-beta2

2013-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1173.
---

Resolution: Fixed

 Upgrade to POI-3.10-beta2
 -

 Key: TIKA-1173
 URL: https://issues.apache.org/jira/browse/TIKA-1173
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1174) Invalid characters in filtered PDF output

2013-09-19 Thread Matt Sheppard (JIRA)
Matt Sheppard created TIKA-1174:
---

 Summary: Invalid characters in filtered PDF output
 Key: TIKA-1174
 URL: https://issues.apache.org/jira/browse/TIKA-1174
 Project: Tika
  Issue Type: Bug
 Environment: Mac OS X 10.8.5, Java 1.7u40 (but also seen on CentOS5)
Reporter: Matt Sheppard
Priority: Minor


The PDF document at 
http://www.logan.qld.gov.au/__data/assets/pdf_file/0010/9496/map_sp_1a_a4.pdf 
produces invalid characters in the output when filtered by Tika 1.4.

{noformat}

/opt/funnelback/mbin/java/bin/java -jar tika-app-1.4.jar map_sp_1c_a4.pdf | hea…
…d -n 40
ERROR - Error: Could not parse predefined CMAP file for 'nullžf °-ˇžl,¡ì$1-UCS2'
?xml version=1.0 encoding=UTF-8?html 
xmlns=http://www.w3.org/1999/xhtml;
head


[snip]

pCycle network
/p
p
/p
pHILEY

/p
{noformat}

Is there any proper way to avoid this, or is the best approach to strip such 
characters from Tika's output?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1174) Invalid characters in filtered PDF output

2013-09-19 Thread Matt Sheppard (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Sheppard updated TIKA-1174:


Attachment: map_sp_1c_a4.pdf

Attached a copy of the PDF in question in case the site is changed.

 Invalid characters in filtered PDF output
 -

 Key: TIKA-1174
 URL: https://issues.apache.org/jira/browse/TIKA-1174
 Project: Tika
  Issue Type: Bug
 Environment: Mac OS X 10.8.5, Java 1.7u40 (but also seen on CentOS5)
Reporter: Matt Sheppard
Priority: Minor
 Attachments: map_sp_1c_a4.pdf


 The PDF document at 
 http://www.logan.qld.gov.au/__data/assets/pdf_file/0010/9496/map_sp_1a_a4.pdf 
 produces invalid characters in the output when filtered by Tika 1.4.
 {noformat}
 
 /opt/funnelback/mbin/java/bin/java -jar tika-app-1.4.jar map_sp_1c_a4.pdf | 
 hea…
 …d -n 40
 ERROR - Error: Could not parse predefined CMAP file for 'nullžf 
 °-ˇžl,¡ì$1-UCS2'
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 [snip]
 pCycle network
 /p
 p
 /p
 pHILEY
 /p
 {noformat}
 Is there any proper way to avoid this, or is the best approach to strip such 
 characters from Tika's output?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira