[jira] [Commented] (TIKA-1675) please avoid xmlbeans dependency

2015-07-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617457#comment-14617457
 ] 

Michael McCandless commented on TIKA-1675:
--

bq. If the project is dead and not fixing packaging bugs like this, i think its 
irresponsible to depend on it.

+1

Maybe POI could/should absorb the parts of xmlbeans it depends on?

 please avoid xmlbeans dependency
 

 Key: TIKA-1675
 URL: https://issues.apache.org/jira/browse/TIKA-1675
 Project: Tika
  Issue Type: Bug
Reporter: Robert Muir

 This dependency (e.g jar file) is fundamentally broken... XMLBEANS-499
 Is there an alternative that could be used?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1628) ExternalParser.check should return false if it hits SecurityException

2015-05-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1628.
--
Resolution: Pending Closed

Thanks [~gagravarr] and [~thetaphi]

 ExternalParser.check should return false if it hits SecurityException
 -

 Key: TIKA-1628
 URL: https://issues.apache.org/jira/browse/TIKA-1628
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.9

 Attachments: TIKA-1628.patch


 If you run Tika with a Java security manager that blocks execution of 
 external processes, ExternalParser.check throws SecurityException, but I 
 think it should just return false?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1544) empty lines are not preserved

2015-02-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309956#comment-14309956
 ] 

Michael McCandless commented on TIKA-1544:
--

bq. Michael McCandless, is the fix this simple?

Hmm, maybe :)

It's strange we are calling endParagraph when inParagraph is false?  Maybe we 
are missing a lazyStartParagraph somewhere?

 empty lines are not preserved
 -

 Key: TIKA-1544
 URL: https://issues.apache.org/jira/browse/TIKA-1544
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6
 Environment: Windows 8, Java 1.8
Reporter: mortee
Priority: Minor
 Attachments: preserve_new_lines_in_rtf.patch, testRTFNewlines.rtf


 I'm trying to extract the text content from RTF documents. The files contain 
 empty lines (two or more consecutive paragraph-end marks), on which the 
 further processing relies to tell apart different parts of the text. But 
 unfortuantely Tika (with --text switch) eliminates all those empty lines, 
 instead of preserving them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1544) empty lines are not preserved

2015-02-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310014#comment-14310014
 ] 

Michael McCandless commented on TIKA-1544:
--

bq. I have hesitation about changing the original logic, because I'm not sure 
why if(inParagraph) was added...maybe for properly closing formatting?

I don't remember why...

bq. would be more general and handle the formatting stuff properly?

I think that may be safer?  In case there was pending styling that needs to be 
closed ...

I think if Tika's tests pass with that change you should commit!

 empty lines are not preserved
 -

 Key: TIKA-1544
 URL: https://issues.apache.org/jira/browse/TIKA-1544
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.6
 Environment: Windows 8, Java 1.8
Reporter: mortee
Priority: Minor
 Attachments: preserve_new_lines_in_rtf.patch, testRTFNewlines.rtf


 I'm trying to extract the text content from RTF documents. The files contain 
 empty lines (two or more consecutive paragraph-end marks), on which the 
 further processing relies to tell apart different parts of the text. But 
 unfortuantely Tika (with --text switch) eliminates all those empty lines, 
 instead of preserving them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1305) New list processing changes appear to be causing RTFParser exception

2014-05-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013647#comment-14013647
 ] 

Michael McCandless commented on TIKA-1305:
--

Net/net the RTF is corrupted right?

But we want to make a best-effort to gloss over the corruption and still 
extract what we can?  I think that makes sense.

+1 for the simple solution, maybe w/ a comment explaining it's best effort when 
we see a corrupted doc?

 New list processing changes appear to be causing RTFParser exception
 

 Key: TIKA-1305
 URL: https://issues.apache.org/jira/browse/TIKA-1305
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
 Environment: Mac OSX 10.7.5
 Tika 1.6-SNAPSHOT
Reporter: Chris Bamford
Priority: Minor
  Labels: newbie
 Attachments: rtfparsererror_2.rtf


 Some RTFs cause RTFParser to throw a RuntimeException:
 Unexpected RuntimeException from org.apache.tika.parser.rtf.RTFParser@425e60f2
 When tracing in the debugger (surfaces in CompositeParser.parse() where it 
 catches the RuntimeException, line 244 in my copy), the exception (e) is:
 java.lang.ArrayIndexOutOfBoundsException: -1
 A committer (Tim Allison) believes that it is being caused by recent list 
 processing changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2014-01-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1078.
--

Resolution: Fixed

Thanks Stefano, I made one small change (added generics: HashSetCharacter) 
and committed.

 TikaCLI: invalid characters in embedded document name causes FNFE when trying 
 to save
 -

 Key: TIKA-1078
 URL: https://issues.apache.org/jira/browse/TIKA-1078
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Reporter: Michael McCandless
 Fix For: 1.5

 Attachments: T-DS_Excel2003-PPT2003_1.xls, tika-1078-2.patch, 
 tika-1078.patch


 Attached document hits this on Windows:
 {noformat}
 C:\java.exe -jar tika-app-1.3.jar -z -x 
 c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
 Extracting 'file0.png' (image/png) to .\file0.png
 Extracting 'file1.emf' (application/x-emf) to .\file1.emf
 Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
 Extracting 'file3.emf' (application/x-emf) to .\file3.emf
 Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
 Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
 .\MBD0016BDE4\?£☺.bin
 Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: 
 Illegal IOException from 
 org.apache.tika.parser.microsoft.OfficeParser@75f875f8
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The 
 filename, directory name, or volume label syntax is incorrect.)
 at java.io.FileOutputStream.init(FileOutputStream.java:205)
 at java.io.FileOutputStream.init(FileOutputStream.java:156)
 at 
 org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
 at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
 at 
 org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 {noformat}
 TikaCLI manages to create the sub-directory, but because the embedded 
 fileName has invalid (for Windows) characters, it fails.
 On Linux it runs fine.
 I think somehow ... we have to sanitize the embedded file name ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2014-01-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869205#comment-13869205
 ] 

Michael McCandless commented on TIKA-1078:
--

Thanks Stefano!

Can you fix the license header on the two new files to match the current 
sources?  Thanks.

Also, we don't normally include @ author tags.

Maybe use a HashSet instead of an array for RESERVED, so it's not an O(N) 
lookup per character?  Also, since you check for  ' ', you shouldn't need any 
entries  0x20?

Sometimes (rarely?), attachment filenames have their own sub-directories, and 
the code today will happily .mkdirs those subdirectories, but it looks like 
with this patch we now replace / and \ with their hex equivalents, instead?  I 
think that's OK...

 TikaCLI: invalid characters in embedded document name causes FNFE when trying 
 to save
 -

 Key: TIKA-1078
 URL: https://issues.apache.org/jira/browse/TIKA-1078
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Reporter: Michael McCandless
 Fix For: 1.5

 Attachments: T-DS_Excel2003-PPT2003_1.xls, tika-1078.patch


 Attached document hits this on Windows:
 {noformat}
 C:\java.exe -jar tika-app-1.3.jar -z -x 
 c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
 Extracting 'file0.png' (image/png) to .\file0.png
 Extracting 'file1.emf' (application/x-emf) to .\file1.emf
 Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
 Extracting 'file3.emf' (application/x-emf) to .\file3.emf
 Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
 Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
 .\MBD0016BDE4\?£☺.bin
 Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: 
 Illegal IOException from 
 org.apache.tika.parser.microsoft.OfficeParser@75f875f8
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The 
 filename, directory name, or volume label syntax is incorrect.)
 at java.io.FileOutputStream.init(FileOutputStream.java:205)
 at java.io.FileOutputStream.init(FileOutputStream.java:156)
 at 
 org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
 at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
 at 
 org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 {noformat}
 TikaCLI manages to create the sub-directory, but because the embedded 
 fileName has invalid (for Windows) characters, it fails.
 On Linux it runs fine.
 I think somehow ... we have to sanitize the embedded file name ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1211) OpenDocument (ODF) parser produces multiple startDocument() events

2013-12-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13850567#comment-13850567
 ] 

Michael McCandless commented on TIKA-1211:
--

+1 to fix XHTMLContentHandler to allow only one startDocument event.

 OpenDocument (ODF) parser produces multiple startDocument() events
 --

 Key: TIKA-1211
 URL: https://issues.apache.org/jira/browse/TIKA-1211
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Uwe Schindler

 Related to SOLR-4809: Solr receives multiple startDocument events when 
 parsing OpenDocumentFiles.
 The parser already prevents multiple endDocuments, but not multiple 
 startDocuments.
 The bug was introduced when we added parsing content.xml and meta.xml 
 (TIKA-736, but both feed elements to the XHTML output, so we get multiple 
 start/endDocuments).



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Resolved] (TIKA-1192) ArrayIndexOutOfBoundsException: 9 parsing RTF

2013-11-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1192.
--

   Resolution: Fixed
Fix Version/s: 1.5

Thanks Dave, I just committed this.

 ArrayIndexOutOfBoundsException: 9 parsing RTF
 -

 Key: TIKA-1192
 URL: https://issues.apache.org/jira/browse/TIKA-1192
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Dave Kincaid
Assignee: Michael McCandless
  Labels: rtf
 Fix For: 1.5

 Attachments: testRTFListOverride.rtf, tika-1192-test-case.patch, 
 tika-1192.patch


 When trying to parse an RTF file I'm getting the following exception. I am 
 not able to attach the file for privacy reasons:
 {noformat}
 java.lang.ArrayIndexOutOfBoundsException: 9
TextExtractor.java:872 
 org.apache.tika.parser.rtf.TextExtractor.processControlWord
TextExtractor.java:566 
 org.apache.tika.parser.rtf.TextExtractor.parseControlWord
TextExtractor.java:492 
 org.apache.tika.parser.rtf.TextExtractor.parseControlToken
TextExtractor.java:459 
 org.apache.tika.parser.rtf.TextExtractor.extract
TextExtractor.java:448 
 org.apache.tika.parser.rtf.TextExtractor.extract
 RTFParser.java:56 
 org.apache.tika.parser.rtf.RTFParser.parse
  (Unknown Source) 
 sun.reflect.NativeMethodAccessorImpl.invoke0
  NativeMethodAccessorImpl.java:57 
 sun.reflect.NativeMethodAccessorImpl.invoke
  DelegatingMethodAccessorImpl.java:43 
 sun.reflect.DelegatingMethodAccessorImpl.invoke
   Method.java:606 
 java.lang.reflect.Method.invoke
 Reflector.java:93 
 clojure.lang.Reflector.invokeMatchingMethod
 Reflector.java:28 
 clojure.lang.Reflector.invokeInstanceMethod
tika_parser.clj:20 rtf-parser.tika-parser/parse
form-init2921349737948661927.clj:1 
 rtf-parser.tika-parser/eval4200
Compiler.java:6619 clojure.lang.Compiler.eval
Compiler.java:6582 clojure.lang.Compiler.eval
 core.clj:2852 clojure.core/eval
  main.clj:259 clojure.main/repl[fn]
  main.clj:259 clojure.main/repl[fn]
  main.clj:277 clojure.main/repl[fn]
  main.clj:277 clojure.main/repl
  RestFn.java:1096 clojure.lang.RestFn.invoke
 interruptible_eval.clj:56 
 clojure.tools.nrepl.middleware.interruptible-eval/evaluate[fn]
  AFn.java:159 
 clojure.lang.AFn.applyToHelper
  AFn.java:151 clojure.lang.AFn.applyTo
  core.clj:617 clojure.core/apply
 core.clj:1788 clojure.core/with-bindings*
   RestFn.java:425 clojure.lang.RestFn.invoke
 interruptible_eval.clj:41 
 clojure.tools.nrepl.middleware.interruptible-eval/evaluate
interruptible_eval.clj:171 
 clojure.tools.nrepl.middleware.interruptible-eval/interruptible-eval[fn]
 core.clj:2330 clojure.core/comp[fn]
interruptible_eval.clj:138 
 clojure.tools.nrepl.middleware.interruptible-eval/run-next[fn]
   AFn.java:24 clojure.lang.AFn.run
  ThreadPoolExecutor.java:1145 
 java.util.concurrent.ThreadPoolExecutor.runWorker
   ThreadPoolExecutor.java:615 
 java.util.concurrent.ThreadPoolExecutor$Worker.run
   Thread.java:724 java.lang.Thread.run
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (TIKA-1192) ArrayIndexOutOfBoundsException: 9 parsing RTF

2013-11-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned TIKA-1192:


Assignee: Michael McCandless

 ArrayIndexOutOfBoundsException: 9 parsing RTF
 -

 Key: TIKA-1192
 URL: https://issues.apache.org/jira/browse/TIKA-1192
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Dave Kincaid
Assignee: Michael McCandless
  Labels: rtf
 Attachments: tika-1192.patch


 When trying to parse an RTF file I'm getting the following exception. I am 
 not able to attach the file for privacy reasons:
 {noformat}
 java.lang.ArrayIndexOutOfBoundsException: 9
TextExtractor.java:872 
 org.apache.tika.parser.rtf.TextExtractor.processControlWord
TextExtractor.java:566 
 org.apache.tika.parser.rtf.TextExtractor.parseControlWord
TextExtractor.java:492 
 org.apache.tika.parser.rtf.TextExtractor.parseControlToken
TextExtractor.java:459 
 org.apache.tika.parser.rtf.TextExtractor.extract
TextExtractor.java:448 
 org.apache.tika.parser.rtf.TextExtractor.extract
 RTFParser.java:56 
 org.apache.tika.parser.rtf.RTFParser.parse
  (Unknown Source) 
 sun.reflect.NativeMethodAccessorImpl.invoke0
  NativeMethodAccessorImpl.java:57 
 sun.reflect.NativeMethodAccessorImpl.invoke
  DelegatingMethodAccessorImpl.java:43 
 sun.reflect.DelegatingMethodAccessorImpl.invoke
   Method.java:606 
 java.lang.reflect.Method.invoke
 Reflector.java:93 
 clojure.lang.Reflector.invokeMatchingMethod
 Reflector.java:28 
 clojure.lang.Reflector.invokeInstanceMethod
tika_parser.clj:20 rtf-parser.tika-parser/parse
form-init2921349737948661927.clj:1 
 rtf-parser.tika-parser/eval4200
Compiler.java:6619 clojure.lang.Compiler.eval
Compiler.java:6582 clojure.lang.Compiler.eval
 core.clj:2852 clojure.core/eval
  main.clj:259 clojure.main/repl[fn]
  main.clj:259 clojure.main/repl[fn]
  main.clj:277 clojure.main/repl[fn]
  main.clj:277 clojure.main/repl
  RestFn.java:1096 clojure.lang.RestFn.invoke
 interruptible_eval.clj:56 
 clojure.tools.nrepl.middleware.interruptible-eval/evaluate[fn]
  AFn.java:159 
 clojure.lang.AFn.applyToHelper
  AFn.java:151 clojure.lang.AFn.applyTo
  core.clj:617 clojure.core/apply
 core.clj:1788 clojure.core/with-bindings*
   RestFn.java:425 clojure.lang.RestFn.invoke
 interruptible_eval.clj:41 
 clojure.tools.nrepl.middleware.interruptible-eval/evaluate
interruptible_eval.clj:171 
 clojure.tools.nrepl.middleware.interruptible-eval/interruptible-eval[fn]
 core.clj:2330 clojure.core/comp[fn]
interruptible_eval.clj:138 
 clojure.tools.nrepl.middleware.interruptible-eval/run-next[fn]
   AFn.java:24 clojure.lang.AFn.run
  ThreadPoolExecutor.java:1145 
 java.util.concurrent.ThreadPoolExecutor.runWorker
   ThreadPoolExecutor.java:615 
 java.util.concurrent.ThreadPoolExecutor$Worker.run
   Thread.java:724 java.lang.Thread.run
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1192) ArrayIndexOutOfBoundsException: 9 parsing RTF

2013-11-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817472#comment-13817472
 ] 

Michael McCandless commented on TIKA-1192:
--

bq. Yes, when that fragment is part of an RTF file it provokes the exception, 
so if you could put it into a valid RTF file it should throw the exception.

Hmm, I've tried that (quickly) but so far cannot provoke it ... I'd really 
prefer to commit a test case along w/ this fix ...

 ArrayIndexOutOfBoundsException: 9 parsing RTF
 -

 Key: TIKA-1192
 URL: https://issues.apache.org/jira/browse/TIKA-1192
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Dave Kincaid
Assignee: Michael McCandless
  Labels: rtf
 Attachments: tika-1192.patch


 When trying to parse an RTF file I'm getting the following exception. I am 
 not able to attach the file for privacy reasons:
 {noformat}
 java.lang.ArrayIndexOutOfBoundsException: 9
TextExtractor.java:872 
 org.apache.tika.parser.rtf.TextExtractor.processControlWord
TextExtractor.java:566 
 org.apache.tika.parser.rtf.TextExtractor.parseControlWord
TextExtractor.java:492 
 org.apache.tika.parser.rtf.TextExtractor.parseControlToken
TextExtractor.java:459 
 org.apache.tika.parser.rtf.TextExtractor.extract
TextExtractor.java:448 
 org.apache.tika.parser.rtf.TextExtractor.extract
 RTFParser.java:56 
 org.apache.tika.parser.rtf.RTFParser.parse
  (Unknown Source) 
 sun.reflect.NativeMethodAccessorImpl.invoke0
  NativeMethodAccessorImpl.java:57 
 sun.reflect.NativeMethodAccessorImpl.invoke
  DelegatingMethodAccessorImpl.java:43 
 sun.reflect.DelegatingMethodAccessorImpl.invoke
   Method.java:606 
 java.lang.reflect.Method.invoke
 Reflector.java:93 
 clojure.lang.Reflector.invokeMatchingMethod
 Reflector.java:28 
 clojure.lang.Reflector.invokeInstanceMethod
tika_parser.clj:20 rtf-parser.tika-parser/parse
form-init2921349737948661927.clj:1 
 rtf-parser.tika-parser/eval4200
Compiler.java:6619 clojure.lang.Compiler.eval
Compiler.java:6582 clojure.lang.Compiler.eval
 core.clj:2852 clojure.core/eval
  main.clj:259 clojure.main/repl[fn]
  main.clj:259 clojure.main/repl[fn]
  main.clj:277 clojure.main/repl[fn]
  main.clj:277 clojure.main/repl
  RestFn.java:1096 clojure.lang.RestFn.invoke
 interruptible_eval.clj:56 
 clojure.tools.nrepl.middleware.interruptible-eval/evaluate[fn]
  AFn.java:159 
 clojure.lang.AFn.applyToHelper
  AFn.java:151 clojure.lang.AFn.applyTo
  core.clj:617 clojure.core/apply
 core.clj:1788 clojure.core/with-bindings*
   RestFn.java:425 clojure.lang.RestFn.invoke
 interruptible_eval.clj:41 
 clojure.tools.nrepl.middleware.interruptible-eval/evaluate
interruptible_eval.clj:171 
 clojure.tools.nrepl.middleware.interruptible-eval/interruptible-eval[fn]
 core.clj:2330 clojure.core/comp[fn]
interruptible_eval.clj:138 
 clojure.tools.nrepl.middleware.interruptible-eval/run-next[fn]
   AFn.java:24 clojure.lang.AFn.run
  ThreadPoolExecutor.java:1145 
 java.util.concurrent.ThreadPoolExecutor.runWorker
   ThreadPoolExecutor.java:615 
 java.util.concurrent.ThreadPoolExecutor$Worker.run
   Thread.java:724 java.lang.Thread.run
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1192) ArrayIndexOutOfBoundsException: 9 parsing RTF

2013-11-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817512#comment-13817512
 ] 

Michael McCandless commented on TIKA-1192:
--

Thanks Dave.

 ArrayIndexOutOfBoundsException: 9 parsing RTF
 -

 Key: TIKA-1192
 URL: https://issues.apache.org/jira/browse/TIKA-1192
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Dave Kincaid
Assignee: Michael McCandless
  Labels: rtf
 Attachments: tika-1192.patch


 When trying to parse an RTF file I'm getting the following exception. I am 
 not able to attach the file for privacy reasons:
 {noformat}
 java.lang.ArrayIndexOutOfBoundsException: 9
TextExtractor.java:872 
 org.apache.tika.parser.rtf.TextExtractor.processControlWord
TextExtractor.java:566 
 org.apache.tika.parser.rtf.TextExtractor.parseControlWord
TextExtractor.java:492 
 org.apache.tika.parser.rtf.TextExtractor.parseControlToken
TextExtractor.java:459 
 org.apache.tika.parser.rtf.TextExtractor.extract
TextExtractor.java:448 
 org.apache.tika.parser.rtf.TextExtractor.extract
 RTFParser.java:56 
 org.apache.tika.parser.rtf.RTFParser.parse
  (Unknown Source) 
 sun.reflect.NativeMethodAccessorImpl.invoke0
  NativeMethodAccessorImpl.java:57 
 sun.reflect.NativeMethodAccessorImpl.invoke
  DelegatingMethodAccessorImpl.java:43 
 sun.reflect.DelegatingMethodAccessorImpl.invoke
   Method.java:606 
 java.lang.reflect.Method.invoke
 Reflector.java:93 
 clojure.lang.Reflector.invokeMatchingMethod
 Reflector.java:28 
 clojure.lang.Reflector.invokeInstanceMethod
tika_parser.clj:20 rtf-parser.tika-parser/parse
form-init2921349737948661927.clj:1 
 rtf-parser.tika-parser/eval4200
Compiler.java:6619 clojure.lang.Compiler.eval
Compiler.java:6582 clojure.lang.Compiler.eval
 core.clj:2852 clojure.core/eval
  main.clj:259 clojure.main/repl[fn]
  main.clj:259 clojure.main/repl[fn]
  main.clj:277 clojure.main/repl[fn]
  main.clj:277 clojure.main/repl
  RestFn.java:1096 clojure.lang.RestFn.invoke
 interruptible_eval.clj:56 
 clojure.tools.nrepl.middleware.interruptible-eval/evaluate[fn]
  AFn.java:159 
 clojure.lang.AFn.applyToHelper
  AFn.java:151 clojure.lang.AFn.applyTo
  core.clj:617 clojure.core/apply
 core.clj:1788 clojure.core/with-bindings*
   RestFn.java:425 clojure.lang.RestFn.invoke
 interruptible_eval.clj:41 
 clojure.tools.nrepl.middleware.interruptible-eval/evaluate
interruptible_eval.clj:171 
 clojure.tools.nrepl.middleware.interruptible-eval/interruptible-eval[fn]
 core.clj:2330 clojure.core/comp[fn]
interruptible_eval.clj:138 
 clojure.tools.nrepl.middleware.interruptible-eval/run-next[fn]
   AFn.java:24 clojure.lang.AFn.run
  ThreadPoolExecutor.java:1145 
 java.util.concurrent.ThreadPoolExecutor.runWorker
   ThreadPoolExecutor.java:615 
 java.util.concurrent.ThreadPoolExecutor$Worker.run
   Thread.java:724 java.lang.Thread.run
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1181) RTFParser not keeping HTML font colors and underscore tags.

2013-10-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788163#comment-13788163
 ] 

Michael McCandless commented on TIKA-1181:
--

The RTFParser currently only carries bold and italic styling through; I guess 
we could add underline.

It doesn't try to preserve any colors.

Really the goal is (mostly?) text extraction, not precise formatting of the 
extracted text, so I think colors/styling are somewhat low priority.  But I 
suppose underling/colors can convey information about how important that text 
was, and so could be useful to stages (like indexing with Lucene) after text 
extraction.


 RTFParser not keeping HTML font colors and underscore tags.
 ---

 Key: TIKA-1181
 URL: https://issues.apache.org/jira/browse/TIKA-1181
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows server 2008
Reporter: Leo
  Labels: RTFParser

 Hi,
 I'm having problems with this code. It does not put the font colors and 
 underscores u/u tags in the HTML from the RTF string. Is there anything 
 I can do to put them there? 
 Code:
 InputStream in = new ByteArrayInputStream(rtfString.getBytes(UTF-8));  
  
 org.apache.tika.parser.rtf.RTFParser parser = new 
 org.apache.tika.parser.rtf.RTFParser();
  
 Metadata metadata = new Metadata();
 StringWriter sw = new StringWriter();
 SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
 TransformerHandler handler = factory.newTransformerHandler();
   
 handler.getTransformer().setOutputProperty(OutputKeys.METHOD, xml);
   
 handler.getTransformer().setOutputProperty(OutputKeys.INDENT, no);
 handler.setResult(new StreamResult(sw));
 parser.parse(in, handler, metadata, new ParseContext());
 String xhtml = sw.toString();
   
 xhtml = xhtml.replaceAll(\r\n, br\r\n);
 Thanks for looking at it.
 Leo



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1143) Fails to parse some PPT file

2013-07-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698837#comment-13698837
 ] 

Michael McCandless commented on TIKA-1143:
--

Are you able to extract text from the rest of the document?

Those logged exceptions look like warnings, indicating that the summary 
information failed to parse ...

 Fails to parse some PPT file
 

 Key: TIKA-1143
 URL: https://issues.apache.org/jira/browse/TIKA-1143
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Vincent Massol
 Attachments: XWikiIExpoPresentation.ppt


 See also http://jira.xwiki.org/browse/XWIKI-9308
 Here's what I get with the attached file:
 {noformat}
 2013-07-03 11:52:45,332 [XWiki Solr index thread] WARN  
 a.t.p.m.AbstractPOIFSExtractor - Ignoring unexpected exception while parsing 
 summary entry DocumentSummaryInformation 
 java.lang.ClassCastException: [B cannot be cast to java.lang.String
   at 
 org.apache.poi.hpsf.DocumentSummaryInformation.getCategory(DocumentSummaryInformation.java:78)
  ~[poi-3.9.jar:3.9]
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parse(SummaryExtractor.java:143)
  [tika-parsers-1.4.jar:na]
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:88)
  [tika-parsers-1.4.jar:na]
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:73)
  [tika-parsers-1.4.jar:na]
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170) 
 [tika-parsers-1.4.jar:na]
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) 
 [tika-parsers-1.4.jar:na]
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
 [tika-core-1.4.jar:na]
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
 [tika-core-1.4.jar:na]
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) 
 [tika-core-1.4.jar:na]
   at org.apache.tika.Tika.parseToString(Tika.java:380) 
 [tika-core-1.4.jar:na]
   at 
 org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.getContentAsText(AttachmentSolrMetadataExtractor.java:130)
  [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na]
   at 
 org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setLocaleAndContentFields(AttachmentSolrMetadataExtractor.java:97)
  [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na]
   at 
 org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setFieldsInternal(AttachmentSolrMetadataExtractor.java:79)
  [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na]
   at 
 org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getSolrDocument(AbstractSolrMetadataExtractor.java:114)
  [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na]
   at 
 org.xwiki.search.solr.internal.DefaultSolrIndexer.getSolrDocument(DefaultSolrIndexer.java:465)
  [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na]
   at 
 org.xwiki.search.solr.internal.DefaultSolrIndexer.processBatch(DefaultSolrIndexer.java:378)
  [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na]
   at 
 org.xwiki.search.solr.internal.DefaultSolrIndexer.runInternal(DefaultSolrIndexer.java:353)
  [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na]
   at 
 com.xpn.xwiki.util.AbstractXWikiRunnable.run(AbstractXWikiRunnable.java:121) 
 [xwiki-platform-oldcore-5.2-20130702.190754-22.jar:na]
   at java.lang.Thread.run(Thread.java:680) [na:1.6.0_51]
 2013-07-03 11:52:49,985 [Lucene Index Updater] WARN  
 a.t.p.m.AbstractPOIFSExtractor - Ignoring unexpected exception while parsing 
 summary entry DocumentSummaryInformation 
 java.lang.ClassCastException: [B cannot be cast to java.lang.String
   at 
 org.apache.poi.hpsf.DocumentSummaryInformation.getCategory(DocumentSummaryInformation.java:78)
  ~[poi-3.9.jar:3.9]
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parse(SummaryExtractor.java:143)
  [tika-parsers-1.4.jar:na]
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:88)
  [tika-parsers-1.4.jar:na]
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:73)
  [tika-parsers-1.4.jar:na]
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170) 
 [tika-parsers-1.4.jar:na]
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) 
 [tika-parsers-1.4.jar:na]
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
 [tika-core-1.4.jar:na]
   at 
 

[jira] [Assigned] (TIKA-1128) Replace line tabulation with line break

2013-05-30 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned TIKA-1128:


Assignee: Michael McCandless

 Replace line tabulation with line break
 ---

 Key: TIKA-1128
 URL: https://issues.apache.org/jira/browse/TIKA-1128
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Privezentsev Konstantin
Assignee: Michael McCandless
Priority: Trivial
 Attachments: 
 0001-TIKA-1128-Replace-line-tabular-by-line-break-when-ex.patch


 Tika WordExtractor not replacing line tabular character by line break like 
 POI WordExtractor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1128) Replace line tabulation with line break

2013-05-30 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1128:
-

Fix Version/s: 1.5

 Replace line tabulation with line break
 ---

 Key: TIKA-1128
 URL: https://issues.apache.org/jira/browse/TIKA-1128
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Privezentsev Konstantin
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 1.5

 Attachments: 
 0001-TIKA-1128-Replace-line-tabular-by-line-break-when-ex.patch


 Tika WordExtractor not replacing line tabular character by line break like 
 POI WordExtractor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1128) Replace line tabulation with line break

2013-05-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13670252#comment-13670252
 ] 

Michael McCandless commented on TIKA-1128:
--

Thanks Privezentsev.

Do you have an example Word document that emits the line tabular character?  
I'd like to add a basic test case ... thanks.

 Replace line tabulation with line break
 ---

 Key: TIKA-1128
 URL: https://issues.apache.org/jira/browse/TIKA-1128
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Privezentsev Konstantin
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 1.5

 Attachments: 
 0001-TIKA-1128-Replace-line-tabular-by-line-break-when-ex.patch


 Tika WordExtractor not replacing line tabular character by line break like 
 POI WordExtractor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1128) Replace line tabulation with line break

2013-05-30 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1128.
--

   Resolution: Fixed
Fix Version/s: (was: 1.5)
   1.4

Thanks Konstantin, I just committed this!

 Replace line tabulation with line break
 ---

 Key: TIKA-1128
 URL: https://issues.apache.org/jira/browse/TIKA-1128
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Konstantin Privezentsev
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 1.4

 Attachments: 
 0001-TIKA-1128-Replace-line-tabular-by-line-break-when-ex.patch, 
 tabular_symbol.doc


 Tika WordExtractor not replacing line tabular character by line break like 
 POI WordExtractor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏

2013-03-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615793#comment-13615793
 ] 

Michael McCandless commented on TIKA-1098:
--

Hmm PDFBox is hitting that exception when Tika calls .getAnnotations.

You might be able to workaround this if you call 
PDFParser.setExtractAnnotationText(false)?  Then Tika shouldn't call 
.getAnnotations...

It looks like PDFBOX-1273 is the same issue.

 not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏
 

 Key: TIKA-1098
 URL: https://issues.apache.org/jira/browse/TIKA-1098
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: linux redhat
Reporter: Qian Diao
 Attachments: url_1763_approx-alg-notes.pdf


 Hi,
 I got some parsing problems when using Tika 1.1 for the attached pdf file.
 my code (Test.java):
 import java.io.File;
 import java.io.InputStream;
 import java.io.FileInputStream;
 import org.apache.tika.metadata.Metadata;
 import org.apache.tika.parser.AutoDetectParser;
 import org.apache.tika.parser.ParseContext;
 import org.apache.tika.parser.Parser;
 import org.apache.tika.parser.html.BoilerpipeContentHandler;
 import org.apache.tika.sax.BodyContentHandler;
 import org.apache.tika.parser.html.HtmlParser;
 import de.l3s.boilerpipe.extractors.ArticleExtractor;
 public class Test {
 private static final String validBoilerpipeFilenameRegEx = 
 .*(\\.)(htm|html|shtml|php|asp|aspx)$;
 public String parseFile(File inFile) {
 if (inFile == null || !inFile.isFile() || !inFile.canRead()) 
 return null;

 InputStream is = null;
 String outputText = ;
 try {
 // Open input stream
 is = new FileInputStream(inFile);
 // Prepare parser
 BodyContentHandler contenthandler = new 
 BodyContentHandler(-1);
 Metadata metadata = new Metadata();
 metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
 ParseContext pc = new ParseContext();
 // Call parse with boilerpipe if valid boilerpipe extension; 
 otherwise, call regular parse.
 if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
 Parser parser = new AutoDetectParser();
 parser.parse(is, contenthandler, metadata, pc);
 }
 else {
 Parser parser = new HtmlParser();
 BoilerpipeContentHandler bh = new 
 BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
 parser.parse(is, bh, metadata, pc);
 }
 // Prepare text for write
 outputText = contenthandler.toString();
 } catch (Exception e) {
 System.out.println(e);
 return null;
 } finally {
 try { 
 if (is != null) 
 is.close(); 
 } catch (Exception e) {}
 }

 return outputText;
 }
 =output
 org.apache.tika.exception.TikaException: Unable to extract PDF content
 url_1763_approx-alg-notes.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585082#comment-13585082
 ] 

Michael McCandless commented on TIKA-1074:
--

bq. My app needs to extract text even from corrupt documents.

That's exactly the intent here as well.

bq. Currently I am setting ParseContext with a custom AutoDetectParser that, 
when an exception is hit, e.g. visiting an embedded, catches the exception, 
logs it AND extracts raw/binary strings from the problematic doc (or embedded)

Wait, the exceptions that this change now catches  logs is in the
decoding an OLE10 embedded entry (into its byte[] data), not in
actually parsing of the resulting byte[] data.  If the exception is
hit later when we recurse into parseEmbedded, the exception is still
thrown as before, so your custom AutoDetectParser will still
see/handle the exception.

But I think this is separately a good idea (an AutoDetectParser
logging  continuing by default): is this something you could possibly
contribute...?

Do you have an example corrupted document?  We could test before/after
this change and see.


 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1074.patch, TIKA-1074.patch


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584176#comment-13584176
 ] 

Michael McCandless commented on TIKA-1074:
--

Thanks Jukka.

InterruptedException is never thrown in these places today, so I can't add the 
separate catch clause (compiler is angry).

So, the instanceof check for IE is in case in the future we do handle 
interrupts in these places ... we could just remove it and add it back in the 
future if we add IE (seems risky).

Or I can change that code to throw TikaException instead on interrupt (and 
restore the interrupt bit), except in the TikaCLI case, 
EmbeddedDocumentExtractor.parseEmbedded doesn't throw TikaException today (the 
other two places already do).  But it's a little weird throw TikaExc in 
response to an interrupt (ie, code above will be trying to catch an IE) ... I 
think it's cleaner to set the interrupt bit and let the next place that waits 
see the interrupt bit and throw IE?

 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1074.patch, TIKA-1074.patch


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584249#comment-13584249
 ] 

Michael McCandless commented on TIKA-1074:
--

{quote}
bq. InterruptedException is never thrown in these places today, so I can't add 
the separate catch clause (compiler is angry).

It's a checked exception, so if it isn't declared to be thrown by POI, it 
shouldn't get thrown here (even though the VM doesn't strictly prohibit that).
{quote}

Exactly: I'm trying to future proof.

bq. So in that case the extra check shouldn't even be needed.

Wait, do you mean I should remove the handling entirely (not bother future 
proofing)?

{quote}
bq. I think it's cleaner to set the interrupt bit and let the next place that 
waits see the interrupt bit and throw IE?

I don't really like this approach. We're essentially saying: Yes, you asked me 
to stop what I'm doing, but instead I'll just finish up what I was doing and 
ask the next guy to stop. Instead, when receiving an IE I'd prefer Tika to 
stop immediately, either by letting the IE bubble up or (where necessary) by 
throwing a TikaException that wraps the IE.
{quote}

OK, maybe we can throw TikaException today (*and* set the interrupt
bit), and then in the future (if/when these places really do throw
IE), we can change this to throwing a IE instead of TikaException.  I
can put that as a TODO.


 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1074.patch, TIKA-1074.patch


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584362#comment-13584362
 ] 

Michael McCandless commented on TIKA-1074:
--

OK I'll remove the future proofing.

 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1074.patch, TIKA-1074.patch


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-21 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1074.
--

Resolution: Fixed

 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1074.patch, TIKA-1074.patch


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-20 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1074.
--

Resolution: Fixed

 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1074.patch


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-20 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1074:
-

Attachment: TIKA-1074.patch

Patch, catching Exception not Throwable, and restoring the interrupt bit if the 
exc was InterruptedException.

 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1074.patch, TIKA-1074.patch


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-20 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened TIKA-1074:
--


 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1074.patch


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582481#comment-13582481
 ] 

Michael McCandless commented on TIKA-1074:
--

Thanks Uwe, I'll change to catching Exception not Throwable, and restoring the 
interrupt bit for InterruptedException.

 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1074.patch


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned TIKA-1074:


Assignee: Michael McCandless

 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.4


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1074:
-

Attachment: TIKA-1074.patch

Patch, just logging a warning and continuing, if we hit the exceptions in 
TIKA-1072, TIKA-1078 or TIKA-1079.  I think it's ready.

 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1074.patch


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-369) Improve accuracy of language detection

2013-02-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13573492#comment-13573492
 ] 

Michael McCandless commented on TIKA-369:
-

The language-detection lib is now in Maven: 
http://search.maven.org/#artifactdetails|com.cybozu.labs|langdetect|1.1-20120112|jar

And it's compiled to Java 5 ...

I think we should do a hard cutover (replace Tika's current language detection 
with this library)?  Any objections?

 Improve accuracy of language detection
 --

 Key: TIKA-369
 URL: https://issues.apache.org/jira/browse/TIKA-369
 Project: Tika
  Issue Type: Improvement
  Components: languageidentifier
Affects Versions: 0.6
Reporter: Ken Krugler
Assignee: Ken Krugler
 Attachments: lingdet-mccs.pdf, Surprise and Coincidence.pdf, 
 textcat.pdf


 Currently the LanguageProfile code uses 3-grams to find the best language 
 profile using Pearson's chi-square test. This has three issues:
 1. The results aren't very good for short runs of text. Ted Dunning's paper 
 (attached) indicates that a log-likelihood ratio (LLR) test works much 
 better, which would then make language detection faster due to less text 
 needing to be processed.
 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact 
 value as a threshold for certainty. This is very sensitive to the amount of 
 text being processed, and thus gives false negative results for short runs of 
 text.
 3. Certainty should also be based on how much better the result is for 
 language X, compared to the next best language. If two languages both had 
 identical sum-of-squares values, and this value was below the threshold, then 
 the result is still not very certain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1053) Upgrade Tika Parsers to use ASM 4.x

2013-02-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1053.
--

   Resolution: Fixed
Fix Version/s: 1.4

Thanks Uwe.

 Upgrade Tika Parsers to use ASM 4.x
 ---

 Key: TIKA-1053
 URL: https://issues.apache.org/jira/browse/TIKA-1053
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.2
Reporter: Vincent Massol
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1053.patch


 Right now Tika 1.2 uses ASM 3.1. 
 However this is causing some issues for us on the XWiki project since we also 
 bundle other framework that use a more recent version of ASM (we use pegdown 
 which uses parboiled which draws ASM 4.0).
 The problem is that ASM 3.x and 4.0 are not compatible...
 See http://jira.xwiki.org/browse/XE-1269 for more details about the issue 
 we're facing.
 Thanks for considering upgrading to ASM 4.x :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2013-02-05 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1078:


 Summary: TikaCLI: invalid characters in embedded document name 
causes FNFE when trying to save
 Key: TIKA-1078
 URL: https://issues.apache.org/jira/browse/TIKA-1078
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 1.4
 Attachments: T-DS_Excel2003-PPT2003_1.xls

Attached document hits this on Windows:

{noformat}
C:\java.exe -jar tika-app-1.3.jar -z -x 
c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
Extracting 'file0.png' (image/png) to .\file0.png
Extracting 'file1.emf' (application/x-emf) to .\file1.emf
Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
Extracting 'file3.emf' (application/x-emf) to .\file3.emf
Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
.\MBD0016BDE4\?£☺.bin
Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: 
Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@75f875f8
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The filename, 
directory name, or volume label syntax is incorrect.)
at java.io.FileOutputStream.init(FileOutputStream.java:205)
at java.io.FileOutputStream.init(FileOutputStream.java:156)
at 
org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
at 
org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 5 more
{noformat}

TikaCLI manages to create the sub-directory, but because the embedded fileName 
has invalid (for Windows) characters, it fails.

On Linux it runs fine.

I think somehow ... we have to sanitize the embedded file name ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2013-02-05 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1078:
-

Attachment: T-DS_Excel2003-PPT2003_1.xls

 TikaCLI: invalid characters in embedded document name causes FNFE when trying 
 to save
 -

 Key: TIKA-1078
 URL: https://issues.apache.org/jira/browse/TIKA-1078
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 1.4

 Attachments: T-DS_Excel2003-PPT2003_1.xls


 Attached document hits this on Windows:
 {noformat}
 C:\java.exe -jar tika-app-1.3.jar -z -x 
 c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
 Extracting 'file0.png' (image/png) to .\file0.png
 Extracting 'file1.emf' (application/x-emf) to .\file1.emf
 Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
 Extracting 'file3.emf' (application/x-emf) to .\file3.emf
 Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
 Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
 .\MBD0016BDE4\?£☺.bin
 Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: 
 Illegal IOException from 
 org.apache.tika.parser.microsoft.OfficeParser@75f875f8
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The 
 filename, directory name, or volume label syntax is incorrect.)
 at java.io.FileOutputStream.init(FileOutputStream.java:205)
 at java.io.FileOutputStream.init(FileOutputStream.java:156)
 at 
 org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
 at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
 at 
 org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 {noformat}
 TikaCLI manages to create the sub-directory, but because the embedded 
 fileName has invalid (for Windows) characters, it fails.
 On Linux it runs fine.
 I think somehow ... we have to sanitize the embedded file name ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries

2013-02-05 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1079:
-

Attachment: guide_to_daips_(id_3152_ver_1.0.0).doc

 Word document hits AIOOBE in SummaryExtractor.parseSummaries
 

 Key: TIKA-1079
 URL: https://issues.apache.org/jira/browse/TIKA-1079
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 1.4

 Attachments: guide_to_daips_(id_3152_ver_1.0.0).doc


 I'm not yet sure if this is a corrupted document (though, MS Word opens it 
 just fine) or a bug in POI ... but I hit this exc when running it through 
 TikaCLI:
 {noformat}
 java.lang.ArrayIndexOutOfBoundsException: -1
   at org.apache.poi.hpsf.CodePageString.init(CodePageString.java:161)
   at 
 org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:158)
   at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:163)
   at org.apache.poi.hpsf.Property.init(Property.java:164)
   at org.apache.poi.hpsf.Section.init(Section.java:277)
   at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451)
   at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:246)
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78)
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:69)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-05 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13571288#comment-13571288
 ] 

Michael McCandless commented on TIKA-1074:
--

TIKA-1079 is another example where if we recorded/logged an exc and moved on we 
could have parsed the rest of the document ...

 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.4


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-02-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570208#comment-13570208
 ] 

Michael McCandless commented on TIKA-1072:
--

OK I did some digging on this.  The DirectoryNode of this embedded document has 
these entries:
{noformat}
ent=PICT size=797
ent=ObjInfo size=4
ent=Ole10Native size=40
ent=Ole10FmtProgID size=13
ent=OlePres000 size=40
ent=CompObj size=82
ent=PIC size=100
ent=META size=582
ent=Ole size=20
{noformat}

And so I believe it really is an OLE10Native record... OLE10Native then tries 
to parse it, with plain=false, but then runs out of bytes on this line:
{noformat}
  flags2 = LittleEndian.getShort(data, ofs);
{noformat}

It seems likely something is corrupt about this entry?  Does 40 bytes seem way 
too small for an OLE10Native entry? If so, I wonder if we could fix 
AbstractPOIFSExtractor to log the exception and then skip this one embedded 
document and then go on to parsing the others?  Ie, isolate the exception, 
rather than aborting the entire extraction; in this case the main document 
extracts fine.

 AIOOBE when handling embedded document in .doc file
 ---

 Key: TIKA-1072
 URL: https://issues.apache.org/jira/browse/TIKA-1072
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 1.4

 Attachments: 20-Force-on-a-current-S00.doc


 I have a Word (.doc) document that hits an exception when I run:
 {noformat}
 java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar 
 /x/tmp/20-Force-on-a-current-S00.doc 
 {noformat}
 Here's the exception:
 {noformat}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
   at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 {noformat}
 It happens when we try to parse an OLE10 embedded object ... the code
 that does this parsing captures and ignores Ole10NativeException and
 skips the entry ... so I'm wondering if we should also catch AIOOBE
 and skip the entry?  Ie, maybe this entry really is not OLE10, and the
 Ole10Native code is failing to throw Ole10NativeException for it?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-02-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570305#comment-13570305
 ] 

Michael McCandless commented on TIKA-1072:
--

Thanks Nick, I'll try asking on dev@poi.

I'll open a separate issue about continuing parsing even when an embedded doc 
hits an exception ...

 AIOOBE when handling embedded document in .doc file
 ---

 Key: TIKA-1072
 URL: https://issues.apache.org/jira/browse/TIKA-1072
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 1.4

 Attachments: 20-Force-on-a-current-S00.doc


 I have a Word (.doc) document that hits an exception when I run:
 {noformat}
 java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar 
 /x/tmp/20-Force-on-a-current-S00.doc 
 {noformat}
 Here's the exception:
 {noformat}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
   at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 {noformat}
 It happens when we try to parse an OLE10 embedded object ... the code
 that does this parsing captures and ignores Ole10NativeException and
 skips the entry ... so I'm wondering if we should also catch AIOOBE
 and skip the entry?  Ie, maybe this entry really is not OLE10, and the
 Ole10Native code is failing to throw Ole10NativeException for it?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-02-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570308#comment-13570308
 ] 

Michael McCandless commented on TIKA-1072:
--

OK I opened TIKA-1074; this issue will explore whether this document is corrupt 
or not ...

 AIOOBE when handling embedded document in .doc file
 ---

 Key: TIKA-1072
 URL: https://issues.apache.org/jira/browse/TIKA-1072
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 1.4

 Attachments: 20-Force-on-a-current-S00.doc


 I have a Word (.doc) document that hits an exception when I run:
 {noformat}
 java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar 
 /x/tmp/20-Force-on-a-current-S00.doc 
 {noformat}
 Here's the exception:
 {noformat}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
   at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 {noformat}
 It happens when we try to parse an OLE10 embedded object ... the code
 that does this parsing captures and ignores Ole10NativeException and
 skips the entry ... so I'm wondering if we should also catch AIOOBE
 and skip the entry?  Ie, maybe this entry really is not OLE10, and the
 Ole10Native code is failing to throw Ole10NativeException for it?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-04 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1074:


 Summary: Extraction should continue if an exception is hit 
visiting an embedded document
 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.4


Spinoff from TIKA-1072.

In that issue, a problematic document (still not sure if document is corrupt, 
or possible POI bug) caused an exception when visiting the embedded documents.

If I change Tika to suppress that exception, the rest of the document extracts 
fine.

So somehow I think we should be more robust here, and maybe log the exception, 
or save/record the exception(s) somewhere so after parsing the app could decide 
what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-02-04 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1072:
-

Attachment: Ole10NativeEntry.bin

I'm attaching the 40 byte \U0001Ole10Native entry (40 bytes); here's the hex 
dump:

  24 00 00 00 02 00 01 01  00 0a 01 12 83 46 02 86  |$F..|
0010  3d 12 83 49 12 83 6c 12  83 42 12 82 73 12 82 69  |=..I..l..B..s..i|
0020  12 82 6e 02 84 71 00 00   |..n..q..|
0028


 AIOOBE when handling embedded document in .doc file
 ---

 Key: TIKA-1072
 URL: https://issues.apache.org/jira/browse/TIKA-1072
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 1.4

 Attachments: 20-Force-on-a-current-S00.doc, Ole10NativeEntry.bin


 I have a Word (.doc) document that hits an exception when I run:
 {noformat}
 java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar 
 /x/tmp/20-Force-on-a-current-S00.doc 
 {noformat}
 Here's the exception:
 {noformat}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
   at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 {noformat}
 It happens when we try to parse an OLE10 embedded object ... the code
 that does this parsing captures and ignores Ole10NativeException and
 skips the entry ... so I'm wondering if we should also catch AIOOBE
 and skip the entry?  Ie, maybe this entry really is not OLE10, and the
 Ole10Native code is failing to throw Ole10NativeException for it?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-02-03 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1072:


 Summary: AIOOBE when handling embedded document in .doc file
 Key: TIKA-1072
 URL: https://issues.apache.org/jira/browse/TIKA-1072
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 1.4
 Attachments: 20-Force-on-a-current-S00.doc

I have a Word (.doc) document that hits an exception when I run:

{noformat}
java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar 
/x/tmp/20-Force-on-a-current-S00.doc 
{noformat}

Here's the exception:

{noformat}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
at 
org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139)
at 
org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
{noformat}

It happens when we try to parse an OLE10 embedded object ... the code
that does this parsing captures and ignores Ole10NativeException and
skips the entry ... so I'm wondering if we should also catch AIOOBE
and skip the entry?  Ie, maybe this entry really is not OLE10, and the
Ole10Native code is failing to throw Ole10NativeException for it?


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-02-03 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1072:
-

Attachment: 20-Force-on-a-current-S00.doc

 AIOOBE when handling embedded document in .doc file
 ---

 Key: TIKA-1072
 URL: https://issues.apache.org/jira/browse/TIKA-1072
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 1.4

 Attachments: 20-Force-on-a-current-S00.doc


 I have a Word (.doc) document that hits an exception when I run:
 {noformat}
 java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar 
 /x/tmp/20-Force-on-a-current-S00.doc 
 {noformat}
 Here's the exception:
 {noformat}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
   at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 {noformat}
 It happens when we try to parse an OLE10 embedded object ... the code
 that does this parsing captures and ignores Ole10NativeException and
 skips the entry ... so I'm wondering if we should also catch AIOOBE
 and skip the entry?  Ie, maybe this entry really is not OLE10, and the
 Ole10Native code is failing to throw Ole10NativeException for it?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1067) Tika extracts non-existent asterisks (*) from .ppt files

2013-01-29 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1067:


 Summary: Tika extracts non-existent asterisks (*) from .ppt files
 Key: TIKA-1067
 URL: https://issues.apache.org/jira/browse/TIKA-1067
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless


I created a new blank presentation, put in title + subtitle, saved it as .ppt, 
and then ran TikaCLI -t:

{noformat}
bodydiv class=slideShowdiv class=slidep 
class=slide-master-content*br/
*br/
/p
p class=slide-contentTestingbr/
testingbr/
/p
/div
/div
div class=slideNotes/
{noformat}

The two extra *'s seem to be coming from the master slide, but I'm not sure 
which text runs they are and how to stop them ...


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1062) Add list detection to RTFParser

2013-01-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562060#comment-13562060
 ] 

Michael McCandless commented on TIKA-1062:
--

Hi Axel,

I don't actually know that Tika has adopted official code style (anyone?).  
Really I was just carrying forward Lucene's code style (put {} around even 
single-line code blocks to avoid future bug risk...).  You succeeded very well, 
and, yes, the current code style varies :)

 Add list detection to RTFParser
 ---

 Key: TIKA-1062
 URL: https://issues.apache.org/jira/browse/TIKA-1062
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Axel Dörfler
Assignee: Michael McCandless
Priority: Minor
  Labels: patch
 Fix For: 1.4

 Attachments: testRTFListLibreOffice.rtf, 
 testRTFListMicrosoftWord.rtf, tika-rtf-lists.patch


 RTF supports lists, and the parser could support those, too, using HTML 
 ul/ol/li tags.
 I'm attaching a patch that implements basic support for Word 97 and newer 
 lists. Nested lists are not supported correctly, yet, though, and a number of 
 formatting options are ignored.
 I've also added test cases for this, and adapted existing tests where needed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1062) Add list detection to RTFParser

2013-01-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560927#comment-13560927
 ] 

Michael McCandless commented on TIKA-1062:
--

Should the ListDescriptor list = listTable.get(listID); in isUnorderedList be 
currentListTable.get instead?

 Add list detection to RTFParser
 ---

 Key: TIKA-1062
 URL: https://issues.apache.org/jira/browse/TIKA-1062
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Axel Dörfler
Assignee: Michael McCandless
Priority: Minor
  Labels: patch
 Attachments: testRTFListLibreOffice.rtf, 
 testRTFListMicrosoftWord.rtf, tika-rtf-lists.patch


 RTF supports lists, and the parser could support those, too, using HTML 
 ul/ol/li tags.
 I'm attaching a patch that implements basic support for Word 97 and newer 
 lists. Nested lists are not supported correctly, yet, though, and a number of 
 formatting options are ignored.
 I've also added test cases for this, and adapted existing tests where needed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1048) XMLParser should add whitespace between elements

2013-01-06 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1048.
--

Resolution: Fixed

 XMLParser should add whitespace between elements
 

 Key: TIKA-1048
 URL: https://issues.apache.org/jira/browse/TIKA-1048
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1048.patch, TIKA-1048.patch


 If the incoming XML is compact (ie doesn't have whitespace between elements), 
 I think we should somehow add whitespace between elements when extracting 
 text?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1048) XMLParser should add whitespace between elements

2012-12-20 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1048:


 Summary: XMLParser should add whitespace between elements
 Key: TIKA-1048
 URL: https://issues.apache.org/jira/browse/TIKA-1048
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.3
 Attachments: TIKA-1048.patch

If the incoming XML is compact (ie doesn't have whitespace between elements), I 
think we should somehow add whitespace between elements when extracting text?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1048) XMLParser should add whitespace between elements

2012-12-20 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1048:
-

Attachment: TIKA-1048.patch

Patch w/ failing test ... I'm not sure where/how to best fix this yet ...

 XMLParser should add whitespace between elements
 

 Key: TIKA-1048
 URL: https://issues.apache.org/jira/browse/TIKA-1048
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1048.patch


 If the incoming XML is compact (ie doesn't have whitespace between elements), 
 I think we should somehow add whitespace between elements when extracting 
 text?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1031) TikaCLI doesn't create sub-dirs when extracting Zip files

2012-12-01 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1031.
--

Resolution: Fixed

 TikaCLI doesn't create sub-dirs when extracting Zip files
 -

 Key: TIKA-1031
 URL: https://issues.apache.org/jira/browse/TIKA-1031
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1031.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1032) Powerpoint (.pptx) can have duplicate embedded ids

2012-12-01 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1032.
--

   Resolution: Fixed
Fix Version/s: 1.3

 Powerpoint (.pptx) can have duplicate embedded ids
 --

 Key: TIKA-1032
 URL: https://issues.apache.org/jira/browse/TIKA-1032
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1032.patch


 Apparently the relId is only unique within one slide ... I fixed it to prefix 
 slideN_.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-712) Master slide text isn't extracted

2012-12-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13508010#comment-13508010
 ] 

Michael McCandless commented on TIKA-712:
-

I committed the patch; I'll leave this issue open for a possible future correct 
fix where we can detect boilerplate text in PPT.

 Master slide text isn't extracted
 -

 Key: TIKA-712
 URL: https://issues.apache.org/jira/browse/TIKA-712
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Attachments: testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx, 
 testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, 
 TIKA-712-master-slide.xml, TIKA-712.patch, TIKA-712.patch


 It looks like we are not getting text from the master slide for PPT
 and PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1035) PDF bookmark text is not extracted

2012-12-01 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1035.
--

Resolution: Fixed

 PDF bookmark text is not extracted
 --

 Key: TIKA-1035
 URL: https://issues.apache.org/jira/browse/TIKA-1035
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1035.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1036) ZIP parsing doesn't leave placeholders for each package entry

2012-12-01 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1036.
--

   Resolution: Fixed
Fix Version/s: 1.3

 ZIP parsing doesn't leave placeholders for each package entry
 -

 Key: TIKA-1036
 URL: https://issues.apache.org/jira/browse/TIKA-1036
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1036.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1035) PDF bookmark text is not extracted

2012-11-30 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1035:


 Summary: PDF bookmark text is not extracted
 Key: TIKA-1035
 URL: https://issues.apache.org/jira/browse/TIKA-1035
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1035) PDF bookmark text is not extracted

2012-11-30 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1035:
-

Attachment: TIKA-1035.patch

Patch w/ test ...

 PDF bookmark text is not extracted
 --

 Key: TIKA-1035
 URL: https://issues.apache.org/jira/browse/TIKA-1035
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1035.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1036) ZIP parsing doesn't leave placeholders for each package entry

2012-11-30 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1036:


 Summary: ZIP parsing doesn't leave placeholders for each package 
entry
 Key: TIKA-1036
 URL: https://issues.apache.org/jira/browse/TIKA-1036
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1036) ZIP parsing doesn't leave placeholders for each package entry

2012-11-30 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1036:
-

Attachment: TIKA-1036.patch

Patch w/ test ...

 ZIP parsing doesn't leave placeholders for each package entry
 -

 Key: TIKA-1036
 URL: https://issues.apache.org/jira/browse/TIKA-1036
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: TIKA-1036.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-712) Master slide text isn't extracted

2012-11-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-712:


Attachment: TIKA-712.patch

I think I found a committable workaround (patch) for including text from the 
master slide for PPT documents: I uncommented the existing code, but then 
exclude text that is type 0 (TITLE_TYPE) or 1 (BODY_TYPE), just for the master 
slide.  In my ad-hoc testing this eliminates the boilerplate text but lets 
other user changes to the master slide come through correctly ... this isn't 
perfect but I think it's a good step forward.

 Master slide text isn't extracted
 -

 Key: TIKA-712
 URL: https://issues.apache.org/jira/browse/TIKA-712
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Attachments: testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx, 
 testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, 
 TIKA-712-master-slide.xml, TIKA-712.patch, TIKA-712.patch


 It looks like we are not getting text from the master slide for PPT
 and PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1033:


 Summary: Tika doesn't parse embedded OLE Chart/Graph objects
 Key: TIKA-1033
 URL: https://issues.apache.org/jira/browse/TIKA-1033
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Priority: Minor
 Attachments: emb.ppt

I have an example ppt that embeds a chart, but Tika mis-identifies it
as an XLS document.

The progID (oleShape.getProgID() in
HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
we seem to detect it as Excel (application/vnd.ms-excel) but then the
ExcelExtractor hits this exception:

{noformat}
org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
instance
at 
org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
at 
org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
at 
org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
at 
org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
at 
org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
{noformat}

Since DelegatingParser silently suppresses all exceptions, when you
run TikaCLI you won't see any exception nor text extracted, but if you
run with -z, it will save 1.xls which if you then try to parse with
TikaCLI hits the above exception.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1033:
-

Attachment: emb.ppt

 Tika doesn't parse embedded OLE Chart/Graph objects
 ---

 Key: TIKA-1033
 URL: https://issues.apache.org/jira/browse/TIKA-1033
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Priority: Minor
 Attachments: emb.ppt


 I have an example ppt that embeds a chart, but Tika mis-identifies it
 as an XLS document.
 The progID (oleShape.getProgID() in
 HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
 we seem to detect it as Excel (application/vnd.ms-excel) but then the
 ExcelExtractor hits this exception:
 {noformat}
 org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
 instance
   at 
 org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
   at 
 org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
   at 
 org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
   at 
 org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
   at 
 org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
   at 
 org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
   at 
 org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
   at 
 org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
 {noformat}
 Since DelegatingParser silently suppresses all exceptions, when you
 run TikaCLI you won't see any exception nor text extracted, but if you
 run with -z, it will save 1.xls which if you then try to parse with
 TikaCLI hits the above exception.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504563#comment-13504563
 ] 

Michael McCandless commented on TIKA-1033:
--

Here's the full stack trace when I parse the .xls file that TikaCLI extracts:
{noformat}
Exception in thread main org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4eaf6cb1
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121)
Caused by: org.apache.poi.hssf.record.RecordFormatException: Unable to 
construct record instance
at 
org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
at 
org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
at 
org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
at 
org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:292)
at 
org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:144)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 5 more
Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data 
(0) to read requested (2) bytes
at 
org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:216)
at 
org.apache.poi.hssf.record.RecordInputStream.readShort(RecordInputStream.java:233)
at 
org.apache.poi.hssf.record.WindowOneRecord.init(WindowOneRecord.java:71)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:57)
... 15 more
{noformat}

 Tika doesn't parse embedded OLE Chart/Graph objects
 ---

 Key: TIKA-1033
 URL: https://issues.apache.org/jira/browse/TIKA-1033
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Priority: Minor
 Attachments: emb.ppt


 I have an example ppt that embeds a chart, but Tika mis-identifies it
 as an XLS document.
 The progID (oleShape.getProgID() in
 HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
 we seem to detect it as Excel (application/vnd.ms-excel) but then the
 ExcelExtractor hits this exception:
 {noformat}
 org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
 instance
   at 
 org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
   at 
 org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
   at 
 org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
   at 
 org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
   at 
 org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
   at 
 org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
   at 
 org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
   at 
 org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
 {noformat}
 Since DelegatingParser silently suppresses all exceptions, when you
 run TikaCLI you won't see any exception nor text extracted, but if you
 run with -z, it will save 1.xls which if you then try 

[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504668#comment-13504668
 ] 

Michael McCandless commented on TIKA-1033:
--

I asked the person who created this test file; here's his answer:
{noformat}
I created the file with my PowerPoint (PowerPoint 2003). 

To embed the chart:

1. Select from the menu Insert
2. Select chart (I selected the default chart)
3. Place the chart
{noformat}


 Tika doesn't parse embedded OLE Chart/Graph objects
 ---

 Key: TIKA-1033
 URL: https://issues.apache.org/jira/browse/TIKA-1033
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Priority: Minor
 Attachments: emb.ppt


 I have an example ppt that embeds a chart, but Tika mis-identifies it
 as an XLS document.
 The progID (oleShape.getProgID() in
 HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
 we seem to detect it as Excel (application/vnd.ms-excel) but then the
 ExcelExtractor hits this exception:
 {noformat}
 org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
 instance
   at 
 org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
   at 
 org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
   at 
 org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
   at 
 org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
   at 
 org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
   at 
 org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
   at 
 org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
   at 
 org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
 {noformat}
 Since DelegatingParser silently suppresses all exceptions, when you
 run TikaCLI you won't see any exception nor text extracted, but if you
 run with -z, it will save 1.xls which if you then try to parse with
 TikaCLI hits the above exception.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504673#comment-13504673
 ] 

Michael McCandless commented on TIKA-1033:
--

bq. The raw chart object looks to actually be an excel file, 

Hmm, so now I'm very confused :)  Did something go wrong when Tika pulled out 
the bits from emb.ppt to create 1.xls?  When I try to open 1.xls in Excel it's 
unhappy (Cannot open Microsoft Graph chart gallery files.).

bq. Note that embedded objects in office files are actually stored as the raw 
object (used for editing), and a rendered version of the file (so that viewing 
the parent document is quick, normally an EMF)

Yeah I see separately the *.emf files being extracted by TikaCLI.

 Tika doesn't parse embedded OLE Chart/Graph objects
 ---

 Key: TIKA-1033
 URL: https://issues.apache.org/jira/browse/TIKA-1033
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Priority: Minor
 Attachments: emb.ppt


 I have an example ppt that embeds a chart, but Tika mis-identifies it
 as an XLS document.
 The progID (oleShape.getProgID() in
 HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
 we seem to detect it as Excel (application/vnd.ms-excel) but then the
 ExcelExtractor hits this exception:
 {noformat}
 org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
 instance
   at 
 org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
   at 
 org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
   at 
 org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
   at 
 org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
   at 
 org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
   at 
 org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
   at 
 org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
   at 
 org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
 {noformat}
 Since DelegatingParser silently suppresses all exceptions, when you
 run TikaCLI you won't see any exception nor text extracted, but if you
 run with -z, it will save 1.xls which if you then try to parse with
 TikaCLI hits the above exception.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504703#comment-13504703
 ] 

Michael McCandless commented on TIKA-1033:
--

Interesting: with PowerPoint 2007, when I double-click the embedded chart, it 
pops up a dialogue box saying To edit this chart using the new features 
available in the 2007 Microsoft Office system, you must first convert it to the 
2007 Office system format.  Do you want to convert this chart to the new 
format?  [Convert] [Convert All] [Edit Existing].  If I click [Edit Existing] 
it lets me edit the chart data in what looks like Excel, in Compatibility 
Mode.

OK I'll open a POI bug and reference back to this issue...

Thanks Nick.

 Tika doesn't parse embedded OLE Chart/Graph objects
 ---

 Key: TIKA-1033
 URL: https://issues.apache.org/jira/browse/TIKA-1033
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Priority: Minor
 Attachments: emb.ppt


 I have an example ppt that embeds a chart, but Tika mis-identifies it
 as an XLS document.
 The progID (oleShape.getProgID() in
 HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
 we seem to detect it as Excel (application/vnd.ms-excel) but then the
 ExcelExtractor hits this exception:
 {noformat}
 org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
 instance
   at 
 org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
   at 
 org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
   at 
 org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
   at 
 org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
   at 
 org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
   at 
 org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
   at 
 org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
   at 
 org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
 {noformat}
 Since DelegatingParser silently suppresses all exceptions, when you
 run TikaCLI you won't see any exception nor text extracted, but if you
 run with -z, it will save 1.xls which if you then try to parse with
 TikaCLI hits the above exception.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects

2012-11-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504726#comment-13504726
 ] 

Michael McCandless commented on TIKA-1033:
--

OK I opened https://issues.apache.org/bugzilla/show_bug.cgi?id=54213

 Tika doesn't parse embedded OLE Chart/Graph objects
 ---

 Key: TIKA-1033
 URL: https://issues.apache.org/jira/browse/TIKA-1033
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Priority: Minor
 Attachments: emb.ppt


 I have an example ppt that embeds a chart, but Tika mis-identifies it
 as an XLS document.
 The progID (oleShape.getProgID() in
 HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
 we seem to detect it as Excel (application/vnd.ms-excel) but then the
 ExcelExtractor hits this exception:
 {noformat}
 org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
 instance
   at 
 org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
   at 
 org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
   at 
 org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
   at 
 org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
   at 
 org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
   at 
 org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
   at 
 org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
   at 
 org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
 {noformat}
 Since DelegatingParser silently suppresses all exceptions, when you
 run TikaCLI you won't see any exception nor text extracted, but if you
 run with -z, it will save 1.xls which if you then try to parse with
 TikaCLI hits the above exception.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1031) TikaCLI doesn't create sub-dirs when extracting Zip files

2012-11-26 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1031:


 Summary: TikaCLI doesn't create sub-dirs when extracting Zip files
 Key: TIKA-1031
 URL: https://issues.apache.org/jira/browse/TIKA-1031
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1031) TikaCLI doesn't create sub-dirs when extracting Zip files

2012-11-26 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1031:
-

Attachment: TIKA-1031.patch

Patch w/ test  fix.

 TikaCLI doesn't create sub-dirs when extracting Zip files
 -

 Key: TIKA-1031
 URL: https://issues.apache.org/jira/browse/TIKA-1031
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1031.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1024) An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag

2012-11-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1024.
--

Resolution: Fixed

 An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty 
 string value for that tag
 

 Key: TIKA-1024
 URL: https://issues.apache.org/jira/browse/TIKA-1024
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: testNakedUTF16BOM.mp3, TIKA-1024.patch


 This seems to be a difference between JVMs: on IBM's JVM I incorrectly see 
 the BOM as the value of the tag, while on Oracle's JVM I correctly get the 
 empty string.
 I'm not sure if this is a bug in IBM's JVM ... the javadocs are not totally 
 clear how a UTF-16 string containing only the BOM should be decoded by new 
 String(...) ... to fix this I think we should just detect this case and 
 short-circuit empty string return.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1025) Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded

2012-11-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1025.
--

   Resolution: Fixed
Fix Version/s: 1.3

 Powerpoint (.ppt) parser doesn't leave placeholder where documents are 
 embedded
 ---

 Key: TIKA-1025
 URL: https://issues.apache.org/jira/browse/TIKA-1025
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1025.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-369) Improve accuracy of language detection

2012-11-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13499838#comment-13499838
 ] 

Michael McCandless commented on TIKA-369:
-

+1 to cut over to https://code.google.com/p/language-detection

 Improve accuracy of language detection
 --

 Key: TIKA-369
 URL: https://issues.apache.org/jira/browse/TIKA-369
 Project: Tika
  Issue Type: Improvement
  Components: languageidentifier
Affects Versions: 0.6
Reporter: Ken Krugler
Assignee: Ken Krugler
 Attachments: lingdet-mccs.pdf, Surprise and Coincidence.pdf, 
 textcat.pdf


 Currently the LanguageProfile code uses 3-grams to find the best language 
 profile using Pearson's chi-square test. This has three issues:
 1. The results aren't very good for short runs of text. Ted Dunning's paper 
 (attached) indicates that a log-likelihood ratio (LLR) test works much 
 better, which would then make language detection faster due to less text 
 needing to be processed.
 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact 
 value as a threshold for certainty. This is very sensitive to the amount of 
 text being processed, and thus gives false negative results for short runs of 
 text.
 3. Certainty should also be based on how much better the result is for 
 language X, compared to the next best language. If two languages both had 
 identical sum-of-squares values, and this value was below the threshold, then 
 the result is still not very certain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1024) An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag

2012-11-13 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1024:


 Summary: An MP3 with an UTF-16 ID3 tag containing only the BOM 
should produce empty string value for that tag
 Key: TIKA-1024
 URL: https://issues.apache.org/jira/browse/TIKA-1024
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3


This seems to be a difference between JVMs: on IBM's JVM I incorrectly see the 
BOM as the value of the tag, while on Oracle's JVM I correctly get the empty 
string.

I'm not sure if this is a bug in IBM's JVM ... the javadocs are not totally 
clear how a UTF-16 string containing only the BOM should be decoded by new 
String(...) ... to fix this I think we should just detect this case and 
short-circuit empty string return.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1024) An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag

2012-11-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1024:
-

Attachment: testNakedUTF16BOM.mp3

 An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty 
 string value for that tag
 

 Key: TIKA-1024
 URL: https://issues.apache.org/jira/browse/TIKA-1024
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: testNakedUTF16BOM.mp3, TIKA-1024.patch


 This seems to be a difference between JVMs: on IBM's JVM I incorrectly see 
 the BOM as the value of the tag, while on Oracle's JVM I correctly get the 
 empty string.
 I'm not sure if this is a bug in IBM's JVM ... the javadocs are not totally 
 clear how a UTF-16 string containing only the BOM should be decoded by new 
 String(...) ... to fix this I think we should just detect this case and 
 short-circuit empty string return.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1024) An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag

2012-11-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1024:
-

Attachment: TIKA-1024.patch

Patch w/ failing test and fix.

 An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty 
 string value for that tag
 

 Key: TIKA-1024
 URL: https://issues.apache.org/jira/browse/TIKA-1024
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: testNakedUTF16BOM.mp3, TIKA-1024.patch


 This seems to be a difference between JVMs: on IBM's JVM I incorrectly see 
 the BOM as the value of the tag, while on Oracle's JVM I correctly get the 
 empty string.
 I'm not sure if this is a bug in IBM's JVM ... the javadocs are not totally 
 clear how a UTF-16 string containing only the BOM should be decoded by new 
 String(...) ... to fix this I think we should just detect this case and 
 short-circuit empty string return.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1025) Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded

2012-11-13 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1025:


 Summary: Powerpoint (.ppt) parser doesn't leave placeholder where 
documents are embedded
 Key: TIKA-1025
 URL: https://issues.apache.org/jira/browse/TIKA-1025
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1025) Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded

2012-11-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1025:
-

Attachment: TIKA-1025.patch

Patch w/ test  fix.

 Powerpoint (.ppt) parser doesn't leave placeholder where documents are 
 embedded
 ---

 Key: TIKA-1025
 URL: https://issues.apache.org/jira/browse/TIKA-1025
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: TIKA-1025.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1019) Document links in Word documents don't leave a placeholder

2012-11-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1019.
--

Resolution: Fixed

 Document links in Word documents don't leave a placeholder
 --

 Key: TIKA-1019
 URL: https://issues.apache.org/jira/browse/TIKA-1019
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: testDocumentLink.doc, TIKA-1019.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1019) Document links in Word documents don't leave a placeholder

2012-11-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1019.
--

Resolution: Fixed

 Document links in Word documents don't leave a placeholder
 --

 Key: TIKA-1019
 URL: https://issues.apache.org/jira/browse/TIKA-1019
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: testDocumentLink.doc, TIKA-1019.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (TIKA-1019) Document links in Word documents don't leave a placeholder

2012-11-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened TIKA-1019:
--


I reverted my commit for now ... the test file was way too large ...

 Document links in Word documents don't leave a placeholder
 --

 Key: TIKA-1019
 URL: https://issues.apache.org/jira/browse/TIKA-1019
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: testDocumentLink.doc, TIKA-1019.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (TIKA-1019) Document links in Word documents don't leave a placeholder

2012-11-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned TIKA-1019:


Assignee: Michael McCandless

 Document links in Word documents don't leave a placeholder
 --

 Key: TIKA-1019
 URL: https://issues.apache.org/jira/browse/TIKA-1019
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1019) Document links in Word documents don't leave a placeholder

2012-11-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1019:
-

Attachment: testDocumentLink.doc
TIKA-1019.patch

Patch w/ test and fix.


 Document links in Word documents don't leave a placeholder
 --

 Key: TIKA-1019
 URL: https://issues.apache.org/jira/browse/TIKA-1019
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: testDocumentLink.doc, TIKA-1019.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1015) Word (.doc) embedded files don't set relationship ID in the Metadata

2012-10-31 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1015.
--

Resolution: Fixed

 Word (.doc) embedded files don't set relationship ID in the Metadata
 

 Key: TIKA-1015
 URL: https://issues.apache.org/jira/browse/TIKA-1015
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1015.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (TIKA-953) Tika failed to recognize non-ustar Tar file?

2012-10-31 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened TIKA-953:
-


I have another non-ustar tar file that's incorrectly detected as 
application/octet-stream (though file identifies it as a tar archive) ...

 Tika failed to recognize non-ustar Tar  file?
 -

 Key: TIKA-953
 URL: https://issues.apache.org/jira/browse/TIKA-953
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.1
Reporter: Jing Li
 Fix For: 1.2

 Attachments: test.tar


 The file type indeed is POSIX tar archive (GNU) when I use command file 
 in linux, but Tika recognize it as application/xhtml+xml.  The class I used 
 with is DefaultDetector. 
 Below is the head data of the file:
 99, 102, 101, 114, 98, 114, 97, 99, 104, 101, 46, 48, 48, 54, 55, 54, 50, 55, 
 57, 45, 53, 54, 54, 55, 50, 52, 47, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 48, 48, 48, 48, 55, 48, 48, 0, 48, 48, 48, 48, 48, 48, 
 48, 0, 48, 48, 48, 48, 48, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 
 48, 0, 49, 49, 55, 55, 55, 49, 49, 52, 50, 48, 53, 0, 48, 49, 51, 51, 51, 49, 
 0, 32, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 117, 115, 116, 97, 114, 32, 32, 0, 114, 111, 111, 116, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 114, 111, 111, 
 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-953) Tika failed to recognize non-ustar Tar file?

2012-10-31 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-953:


Attachment: test2.tar

file reports this as a tar archive, but:
{noformat}
cat test2.tar | java -jar tika-app/target/tika-app-1.3-SNAPSHOT.jar --detect
{noformat}
says application/octet-stream.

I created the tar file with 7z:
{noformat}
7z a -ttar test2.tar New\ Text\ Document.txt 
{noformat}

 Tika failed to recognize non-ustar Tar  file?
 -

 Key: TIKA-953
 URL: https://issues.apache.org/jira/browse/TIKA-953
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.1
Reporter: Jing Li
 Fix For: 1.2

 Attachments: test2.tar, test.tar


 The file type indeed is POSIX tar archive (GNU) when I use command file 
 in linux, but Tika recognize it as application/xhtml+xml.  The class I used 
 with is DefaultDetector. 
 Below is the head data of the file:
 99, 102, 101, 114, 98, 114, 97, 99, 104, 101, 46, 48, 48, 54, 55, 54, 50, 55, 
 57, 45, 53, 54, 54, 55, 50, 52, 47, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 48, 48, 48, 48, 55, 48, 48, 0, 48, 48, 48, 48, 48, 48, 
 48, 0, 48, 48, 48, 48, 48, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 
 48, 0, 49, 49, 55, 55, 55, 49, 49, 52, 50, 48, 53, 0, 48, 49, 51, 51, 51, 49, 
 0, 32, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 117, 115, 116, 97, 114, 32, 32, 0, 114, 111, 111, 116, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 114, 111, 111, 
 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1015) Word (.doc) embedded files don't set relationship ID in the Metadata

2012-10-30 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1015:


 Summary: Word (.doc) embedded files don't set relationship ID in 
the Metadata
 Key: TIKA-1015
 URL: https://issues.apache.org/jira/browse/TIKA-1015
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1015) Word (.doc) embedded files don't set relationship ID in the Metadata

2012-10-30 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1015:
-

Attachment: TIKA-1015.patch

Simple patch, but my only slight hesitation is I added an argument to the 
protected AbstractPOIFSExtractor.handleEmbeddedResource method.

I'm assuming this is OK to do (it's intended for the concrete per-document-type 
subclasses we have), but if an expert Tika user out there has a custom 
subclass, and they invoke this method, then they'll have to update their 
sources ... but this is very expert so I think it's OK.

 Word (.doc) embedded files don't set relationship ID in the Metadata
 

 Key: TIKA-1015
 URL: https://issues.apache.org/jira/browse/TIKA-1015
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1015.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1011) Exception (Null charset name) processing .mhtml file

2012-10-26 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1011.
--

Resolution: Fixed

 Exception (Null charset name) processing .mhtml file
 

 Key: TIKA-1011
 URL: https://issues.apache.org/jira/browse/TIKA-1011
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1011.patch


 This small test.mhtml file:
 {noformat}
 From: Saved by Windows Internet Explorer 8
 Subject: Index Pages
 Date: Tue, 28 Aug 2012 09:53:28 +0300
 MIME-Version: 1.0
 Content-Type: multipart/related;
   type=multipart/alternative;
   boundary==_NextPart_000__01CD8502.F991E790
 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
 This is a multi-part message in MIME format.
 --=_NextPart_000__01CD8502.F991E790
 Content-Type: multipart/alternative;
   boundary==_NextPart_001_0023_01CD8502.F99DCE70
 --=_NextPart_001_0023_01CD8502.F99DCE70
 Content-Type: text/html;
   charset=x-user-defined
 Content-Transfer-Encoding: quoted-printable
 {noformat}
 Hits this exception when run through TikaCLI:
 {noformat}
 ?xml version=1.0 encoding=UTF-8?Exception in thread main 
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
 org.apache.tika.parser.html.HtmlParser@37e67d34
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121)
 Caused by: java.lang.IllegalArgumentException: Null charset name
   at java.nio.charset.Charset.lookup(Charset.java:467)
   at java.nio.charset.Charset.forName(Charset.java:540)
   at 
 org.apache.tika.parser.txt.CharsetDetector.setCanonicalDeclaredEncoding(CharsetDetector.java:352)
   at 
 org.apache.tika.parser.txt.CharsetDetector.setDeclaredEncoding(CharsetDetector.java:75)
   at 
 org.apache.tika.parser.txt.Icu4jEncodingDetector.detect(Icu4jEncodingDetector.java:49)
   at 
 org.apache.tika.detect.AutoDetectReader.detect(AutoDetectReader.java:51)
   at 
 org.apache.tika.detect.AutoDetectReader.init(AutoDetectReader.java:92)
   at 
 org.apache.tika.detect.AutoDetectReader.init(AutoDetectReader.java:98)
   at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:74)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 11 more
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1011) Exception (Null charset name) processing .mhtml file

2012-10-25 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1011:


 Summary: Exception (Null charset name) processing .mhtml file
 Key: TIKA-1011
 URL: https://issues.apache.org/jira/browse/TIKA-1011
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3


This small test.mhtml file:

{noformat}
From: Saved by Windows Internet Explorer 8
Subject: Index Pages
Date: Tue, 28 Aug 2012 09:53:28 +0300
MIME-Version: 1.0
Content-Type: multipart/related;
type=multipart/alternative;
boundary==_NextPart_000__01CD8502.F991E790
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157

This is a multi-part message in MIME format.

--=_NextPart_000__01CD8502.F991E790
Content-Type: multipart/alternative;
boundary==_NextPart_001_0023_01CD8502.F99DCE70


--=_NextPart_001_0023_01CD8502.F99DCE70
Content-Type: text/html;
charset=x-user-defined
Content-Transfer-Encoding: quoted-printable
{noformat}

Hits this exception when run through TikaCLI:

{noformat}
?xml version=1.0 encoding=UTF-8?Exception in thread main 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.html.HtmlParser@37e67d34
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
at 
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121)
Caused by: java.lang.IllegalArgumentException: Null charset name
at java.nio.charset.Charset.lookup(Charset.java:467)
at java.nio.charset.Charset.forName(Charset.java:540)
at 
org.apache.tika.parser.txt.CharsetDetector.setCanonicalDeclaredEncoding(CharsetDetector.java:352)
at 
org.apache.tika.parser.txt.CharsetDetector.setDeclaredEncoding(CharsetDetector.java:75)
at 
org.apache.tika.parser.txt.Icu4jEncodingDetector.detect(Icu4jEncodingDetector.java:49)
at 
org.apache.tika.detect.AutoDetectReader.detect(AutoDetectReader.java:51)
at 
org.apache.tika.detect.AutoDetectReader.init(AutoDetectReader.java:92)
at 
org.apache.tika.detect.AutoDetectReader.init(AutoDetectReader.java:98)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:74)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 11 more
{noformat}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1011) Exception (Null charset name) processing .mhtml file

2012-10-25 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1011:
-

Attachment: TIKA-1011.patch

 Exception (Null charset name) processing .mhtml file
 

 Key: TIKA-1011
 URL: https://issues.apache.org/jira/browse/TIKA-1011
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-1011.patch


 This small test.mhtml file:
 {noformat}
 From: Saved by Windows Internet Explorer 8
 Subject: Index Pages
 Date: Tue, 28 Aug 2012 09:53:28 +0300
 MIME-Version: 1.0
 Content-Type: multipart/related;
   type=multipart/alternative;
   boundary==_NextPart_000__01CD8502.F991E790
 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
 This is a multi-part message in MIME format.
 --=_NextPart_000__01CD8502.F991E790
 Content-Type: multipart/alternative;
   boundary==_NextPart_001_0023_01CD8502.F99DCE70
 --=_NextPart_001_0023_01CD8502.F99DCE70
 Content-Type: text/html;
   charset=x-user-defined
 Content-Transfer-Encoding: quoted-printable
 {noformat}
 Hits this exception when run through TikaCLI:
 {noformat}
 ?xml version=1.0 encoding=UTF-8?Exception in thread main 
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
 org.apache.tika.parser.html.HtmlParser@37e67d34
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121)
 Caused by: java.lang.IllegalArgumentException: Null charset name
   at java.nio.charset.Charset.lookup(Charset.java:467)
   at java.nio.charset.Charset.forName(Charset.java:540)
   at 
 org.apache.tika.parser.txt.CharsetDetector.setCanonicalDeclaredEncoding(CharsetDetector.java:352)
   at 
 org.apache.tika.parser.txt.CharsetDetector.setDeclaredEncoding(CharsetDetector.java:75)
   at 
 org.apache.tika.parser.txt.Icu4jEncodingDetector.detect(Icu4jEncodingDetector.java:49)
   at 
 org.apache.tika.detect.AutoDetectReader.detect(AutoDetectReader.java:51)
   at 
 org.apache.tika.detect.AutoDetectReader.init(AutoDetectReader.java:92)
   at 
 org.apache.tika.detect.AutoDetectReader.init(AutoDetectReader.java:98)
   at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:74)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 11 more
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1010) Embedded documents in RTF are not extracted

2012-10-19 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-1010:


 Summary: Embedded documents in RTF are not extracted
 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless


When an RTF doc embeds a doc it looks like this:

{noformat}
{\object\objemb
\objw628\objh765{\*\objclass Package}{\*\objdata 
0105020008005061636b616765006600
020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
5404bbfaee00080054044505
01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
{noformat}

But, unfortunately, the format of those hex bytes is not spelled out
in the RTF spec ... the spec merely says the bytes are saved by the
OLESaveToStream function ... and I haven't been able to find a
description of what the bytes mean.

In this case they are a Package object (\objclass Package), which I
think is an [old?] way to wrap any non-OLE file (this is just a .txt
file).

Here's the hex dump:
{noformat}
  01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
0190  01 01 00 03 00 00 00 00  00   |.|
0199
{noformat}

Anyway I have no idea how to decode the bytes at this point ... just
opening the issue in case anyone else does!


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.

2012-10-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-1005:
-

Attachment: TIKA-1005.patch

Patch w/ test ...

 In Microsoft Office Word 2010 documents, text inside a textbox is not 
 extracted/parsed out.
 ---

 Key: TIKA-1005
 URL: https://issues.apache.org/jira/browse/TIKA-1005
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
 Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 
 (32bit and 64bit each)
Reporter: David A. Patterson
Assignee: Michael McCandless
 Attachments: Textbox example.docx, TIKA-1005.patch


 Text inside a textbox, which itself can be in the body, the header or the 
 footer, is not extracted using any type of parser (including 
 AutoDetectParser) in combination with any type of ContentHandler.  This is 
 NOT a duplicate of TIKA-904.  This specifically concerns the .docx file 
 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (TIKA-1006) NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator

2012-10-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned TIKA-1006:


Assignee: Michael McCandless

 NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator
 --

 Key: TIKA-1006
 URL: https://issues.apache.org/jira/browse/TIKA-1006
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
Reporter: Sture Svensson
Assignee: Michael McCandless
Priority: Minor
 Attachments: fix.patch


 The following line 
 TagAndStyle tas = 
 WordExtractor.buildParagraphTagAndStyle(style.getName(),paragraph.getPartType()
  == BodyType.TABLECELL);
 Throws an NPE if style is null. This should be checked, patch is attatched

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1006) NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator

2012-10-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474947#comment-13474947
 ] 

Michael McCandless commented on TIKA-1006:
--

Thanks Sture, that patch looks good!

Do you have an example .docx showing the issue?  Would be nice to commit a test 
case along with the bug fix ...

 NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator
 --

 Key: TIKA-1006
 URL: https://issues.apache.org/jira/browse/TIKA-1006
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
Reporter: Sture Svensson
Assignee: Michael McCandless
Priority: Minor
 Attachments: fix.patch


 The following line 
 TagAndStyle tas = 
 WordExtractor.buildParagraphTagAndStyle(style.getName(),paragraph.getPartType()
  == BodyType.TABLECELL);
 Throws an NPE if style is null. This should be checked, patch is attatched

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.

2012-10-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned TIKA-1005:


Assignee: Michael McCandless

 In Microsoft Office Word 2010 documents, text inside a textbox is not 
 extracted/parsed out.
 ---

 Key: TIKA-1005
 URL: https://issues.apache.org/jira/browse/TIKA-1005
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
 Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 
 (32bit and 64bit each)
Reporter: David A. Patterson
Assignee: Michael McCandless
 Attachments: Textbox example.docx


 Text inside a textbox, which itself can be in the body, the header or the 
 footer, is not extracted using any type of parser (including 
 AutoDetectParser) in combination with any type of ContentHandler.  This is 
 NOT a duplicate of TIKA-904.  This specifically concerns the .docx file 
 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.

2012-10-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474958#comment-13474958
 ] 

Michael McCandless commented on TIKA-1005:
--

Thanks David, I'll dig!

 In Microsoft Office Word 2010 documents, text inside a textbox is not 
 extracted/parsed out.
 ---

 Key: TIKA-1005
 URL: https://issues.apache.org/jira/browse/TIKA-1005
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
 Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 
 (32bit and 64bit each)
Reporter: David A. Patterson
Assignee: Michael McCandless
 Attachments: Textbox example.docx


 Text inside a textbox, which itself can be in the body, the header or the 
 footer, is not extracted using any type of parser (including 
 AutoDetectParser) in combination with any type of ContentHandler.  This is 
 NOT a duplicate of TIKA-904.  This specifically concerns the .docx file 
 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1006) NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator

2012-10-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-1006.
--

   Resolution: Fixed
Fix Version/s: 1.3

Thanks Sture, I just committed the test document  fix!

 NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator
 --

 Key: TIKA-1006
 URL: https://issues.apache.org/jira/browse/TIKA-1006
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
Reporter: Sture Svensson
Assignee: Michael McCandless
Priority: Minor
 Fix For: 1.3

 Attachments: example.docx, fix.patch


 The following line 
 TagAndStyle tas = 
 WordExtractor.buildParagraphTagAndStyle(style.getName(),paragraph.getPartType()
  == BodyType.TABLECELL);
 Throws an NPE if style is null. This should be checked, patch is attatched

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.

2012-10-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474250#comment-13474250
 ] 

Michael McCandless commented on TIKA-1005:
--

Could you attach an example showing the problem?  Thanks.

 In Microsoft Office Word 2010 documents, text inside a textbox is not 
 extracted/parsed out.
 ---

 Key: TIKA-1005
 URL: https://issues.apache.org/jira/browse/TIKA-1005
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
 Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 
 (32bit and 64bit each)
Reporter: David A. Patterson

 Text inside a textbox, which itself can be in the body, the header or the 
 footer, is not extracted using any type of parser (including 
 AutoDetectParser) in combination with any type of ContentHandler.  This is 
 NOT a duplicate of TIKA-904.  This specifically concerns the .docx file 
 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-997) Leave a placeholder when documents are embedded in .pptx documents

2012-09-28 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-997.
-

   Resolution: Fixed
Fix Version/s: 1.3

 Leave a placeholder when documents are embedded in .pptx documents
 --

 Key: TIKA-997
 URL: https://issues.apache.org/jira/browse/TIKA-997
 Project: Tika
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-997.patch


 Just like TIKA-956, we should leave a div class=embedded id=XXX to 
 record where a given sub-document appeared.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-997) Leave a placeholder when documents are embedded in .pptx documents

2012-09-26 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-997:


Attachment: TIKA-997.patch

Patch.

It's not perfect, because the placeholder will appear at the end of the slide 
that embedded the document.  I think to do better we'd need to parse out x/y 
positions of each element and sort that, but that's going to get rather hairy 
... so at least this is progress.

 Leave a placeholder when documents are embedded in .pptx documents
 --

 Key: TIKA-997
 URL: https://issues.apache.org/jira/browse/TIKA-997
 Project: Tika
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: TIKA-997.patch


 Just like TIKA-956, we should leave a div class=embedded id=XXX to 
 record where a given sub-document appeared.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-999) RTF Parser doesn't extract page/word/character count metadata

2012-09-26 Thread Michael McCandless (JIRA)
Michael McCandless created TIKA-999:
---

 Summary: RTF Parser doesn't extract page/word/character count 
metadata
 Key: TIKA-999
 URL: https://issues.apache.org/jira/browse/TIKA-999
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.3




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   3   >