date:20130919

building tika from scratch without pulling 1.5-SNAPSHOT from the repository?

2013-09-19 Thread Allison, Timothy B.

All,
  Is there an easy way to build Tika from scratch without reliance on 
1.5-SNAPSHOT in the mvn repository and without building the components in the 
correct order and then manually loading them into a local mvn repository?
  At the main level, I've been using a simple 'mvn package'

   Thank you.

  Best,

 Tim


Details below...

  I modified the poi version to beta2 in the pom under parsers.  When I package 
tika, mvn pulls beta2 from the repository when building parsers, and all is 
well.

However, when mvn gets to building the xmp module, it pulls beta1, and beta1 is 
winding up in the tika-app.jar.

  I manually removed beta1 from my local repository and tried to package with 
-o (offline) and I got the error below.

  It looks like mvn is pulling 1.5-SNAPSHOT from the maven repository and not 
from the local build.

[INFO] Copying 3 resources
[INFO] 
[ERROR] BUILD ERROR
[INFO] 
[INFO] Failed to resolve artifact.

Missing:
--
1) org.apache.poi:poi:jar:3.10-beta1

  Try downloading the file manually from the project website.

  Then, install it using the command:
  mvn install:install-file -DgroupId=org.apache.poi -DartifactId=poi 
-Dversion=3.10-beta1 -Dpackaging=jar -Dfile=/path/to/file

  Alternatively, if you host your own repository you can deploy the file there:
  mvn deploy:deploy-file -DgroupId=org.apache.poi -DartifactId=poi 
-Dversion=3.10-beta1 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] 
-DrepositoryId=[id]

  Path to dependency:
1) org.apache.tika:tika-xmp:bundle:1.5-SNAPSHOT
2) org.apache.tika:tika-parsers:jar:1.5-SNAPSHOT
3) org.apache.poi:poi:jar:3.10-beta1

RE: building tika from scratch without pulling 1.5-SNAPSHOT from the repository?

2013-09-19 Thread Allison, Timothy B.

Patience is an option, I guess. :)

Sorry of the non-information for those who know, but I learned that a commit 
kicks off Jenkins, and it looks like a nightly build in Jenkins makes it into a 
maven repository (https://issues.apache.org/jira/browse/TIKA-162). 

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, September 19, 2013 2:20 PM
To: dev@tika.apache.org
Subject: building tika from scratch without pulling 1.5-SNAPSHOT from the 
repository?

All,
  Is there an easy way to build Tika from scratch without reliance on 
1.5-SNAPSHOT in the mvn repository and without building the components in the 
correct order and then manually loading them into a local mvn repository?
  At the main level, I've been using a simple 'mvn package'

   Thank you.

  Best,

 Tim

Details below...

  I modified the poi version to beta2 in the pom under parsers.  When I package 
tika, mvn pulls beta2 from the repository when building parsers, and all is 
well.

However, when mvn gets to building the xmp module, it pulls beta1, and beta1 is 
winding up in the tika-app.jar.

  I manually removed beta1 from my local repository and tried to package with 
-o (offline) and I got the error below.

  It looks like mvn is pulling 1.5-SNAPSHOT from the maven repository and not 
from the local build.

[INFO] Copying 3 resources
[INFO] 
[ERROR] BUILD ERROR
[INFO] 
[INFO] Failed to resolve artifact.

Missing:
--
1) org.apache.poi:poi:jar:3.10-beta1

  Try downloading the file manually from the project website.

  Then, install it using the command:
  mvn install:install-file -DgroupId=org.apache.poi -DartifactId=poi 
-Dversion=3.10-beta1 -Dpackaging=jar -Dfile=/path/to/file

  Alternatively, if you host your own repository you can deploy the file there:
  mvn deploy:deploy-file -DgroupId=org.apache.poi -DartifactId=poi 
-Dversion=3.10-beta1 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] 
-DrepositoryId=[id]

  Path to dependency:
1) org.apache.tika:tika-xmp:bundle:1.5-SNAPSHOT
2) org.apache.tika:tika-parsers:jar:1.5-SNAPSHOT
3) org.apache.poi:poi:jar:3.10-beta1

[jira] [Reopened] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-09-19 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Allison reopened TIKA-1132:
---

Assignee: Tim Allison

Will add test case in Tika.

Parsing some XLS documents hangs entire JVM, requires kill -9
-

Key: TIKA-1132
URL: https://issues.apache.org/jira/browse/TIKA-1132
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.2, 1.3
Environment: Linux Suse:
java version 1.7.0
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
OSX 10.8.3:
java version 1.7.0_06
Java(TM) SE Runtime Environment (build 1.7.0_06-b24)
Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode)
Reporter: Ryan Krueger
Assignee: Tim Allison
Fix For: 1.5

Attachments: mod3.xlsx, mod.xls

Some XLS documents hang the entire JVM. A control-C or regular kill won't
stop the JVM, a kill -9 is required.
We're running within an email server application parsing documents to extract
text of all attachments. When we hit a message with the affected attachment
the entire JVM hangs and we mark the message to skip extracting the text from
the affected message the next attempt. Unfortunately, it kills all email
processing on the server until the internal watchdogs kill -9 the application.
We have seen the issue for several months with different documents, but they
are always Excel files. Some get complaints from Excel when opening but not
all.
In addition to experiencing the problem on our Linux servers I have tested on
OSX and experienced the same problems. I ran the Tika UI and select the
affected file or run the CLI. The problem is the same.
Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls
When running on multi-CPU machines there are two threads running at 100%
every time.
I have attached a document that triggers the error.
I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is
accurately extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (TIKA-1173) Upgrade to POI-3.10-beta2

2013-09-19 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1173:
-

 Summary: Upgrade to POI-3.10-beta2
 Key: TIKA-1173
 URL: https://issues.apache.org/jira/browse/TIKA-1173
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1173) Upgrade to POI-3.10-beta2

2013-09-19 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1173.
---

Resolution: Fixed

 Upgrade to POI-3.10-beta2
 -

 Key: TIKA-1173
 URL: https://issues.apache.org/jira/browse/TIKA-1173
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (TIKA-1174) Invalid characters in filtered PDF output

2013-09-19 Thread Matt Sheppard (JIRA)

Matt Sheppard created TIKA-1174:
---

 Summary: Invalid characters in filtered PDF output
 Key: TIKA-1174
 URL: https://issues.apache.org/jira/browse/TIKA-1174
 Project: Tika
  Issue Type: Bug
 Environment: Mac OS X 10.8.5, Java 1.7u40 (but also seen on CentOS5)
Reporter: Matt Sheppard
Priority: Minor


The PDF document at 
http://www.logan.qld.gov.au/__data/assets/pdf_file/0010/9496/map_sp_1a_a4.pdf 
produces invalid characters in the output when filtered by Tika 1.4.

{noformat}

/opt/funnelback/mbin/java/bin/java -jar tika-app-1.4.jar map_sp_1c_a4.pdf | hea…
…d -n 40
ERROR - Error: Could not parse predefined CMAP file for 'nullžf °-ˇžl,¡ì$1-UCS2'
?xml version=1.0 encoding=UTF-8?html 
xmlns=http://www.w3.org/1999/xhtml;
head


[snip]

pCycle network
/p
p
/p
pHILEY

/p
{noformat}

Is there any proper way to avoid this, or is the best approach to strip such 
characters from Tika's output?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-1174) Invalid characters in filtered PDF output

2013-09-19 Thread Matt Sheppard (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Sheppard updated TIKA-1174:


Attachment: map_sp_1c_a4.pdf

Attached a copy of the PDF in question in case the site is changed.

 Invalid characters in filtered PDF output
 -

 Key: TIKA-1174
 URL: https://issues.apache.org/jira/browse/TIKA-1174
 Project: Tika
  Issue Type: Bug
 Environment: Mac OS X 10.8.5, Java 1.7u40 (but also seen on CentOS5)
Reporter: Matt Sheppard
Priority: Minor
 Attachments: map_sp_1c_a4.pdf


 The PDF document at 
 http://www.logan.qld.gov.au/__data/assets/pdf_file/0010/9496/map_sp_1a_a4.pdf 
 produces invalid characters in the output when filtered by Tika 1.4.
 {noformat}
 
 /opt/funnelback/mbin/java/bin/java -jar tika-app-1.4.jar map_sp_1c_a4.pdf | 
 hea…
 …d -n 40
 ERROR - Error: Could not parse predefined CMAP file for 'nullžf 
 °-ˇžl,¡ì$1-UCS2'
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 [snip]
 pCycle network
 /p
 p
 /p
 pHILEY
 /p
 {noformat}
 Is there any proper way to avoid this, or is the best approach to strip such 
 characters from Tika's output?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

building tika from scratch without pulling 1.5-SNAPSHOT from the repository?

RE: building tika from scratch without pulling 1.5-SNAPSHOT from the repository?

[jira] [Reopened] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

[jira] [Created] (TIKA-1173) Upgrade to POI-3.10-beta2

[jira] [Resolved] (TIKA-1173) Upgrade to POI-3.10-beta2

[jira] [Created] (TIKA-1174) Invalid characters in filtered PDF output

[jira] [Updated] (TIKA-1174) Invalid characters in filtered PDF output

7 matches

Site Navigation

Mail list logo

Footer information