[jira] [Commented] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)

2013-06-27 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694521#comment-13694521
 ] 

Daniel Bonniot de Ruisselet commented on TIKA-1109:
---

Nick, thanks a lot for your explanation. If I understand correctly, what you 
are saying is that in general it cannot be guaranteed that the metadata is 
available during parsing, since that will depend on the format whether that's 
possible or not. That makes complete sense.

Here I am asking specifically about the OOXML formats, with an example pptx 
file. As I understand the OOXML formats are zip files containing xml files. In 
test-classes/test-documents/testPPT.pptx, the metadata seems to be inside 
docProps/core.xml. Would it be possible to read the metadata first from there, 
before starting the parsing?


 Metadata not extracted before the context in OOXML (pptx)
 -

 Key: TIKA-1109
 URL: https://issues.apache.org/jira/browse/TIKA-1109
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
Priority: Critical
 Fix For: 1.5


 It seems that when processing OOXML documents, the metadata is only read 
 after the text. This means it's impossible to use the medata while processing 
 the text. I think it would be more useful to have the metadata populated 
 first.
 As a symptom:
 java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
 outputs only as metadata:
 meta name=Content-Length content=36518/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=resourceName content=testPPT.pptx/
 while there is more medata in the file (e.g. dc:titleAttachment 
 Test/dc:title).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)

2013-06-27 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694612#comment-13694612
 ] 

Nick Burch commented on TIKA-1109:
--

For ooxml files, the metadata is mostly in a few different xml files within the 
zip. For excel, there's also a few bits stored in the main spreadsheet / sheet 
stream too...

Not sure if it would break things if we did most of the metadata fetching 
first. Could you try moving the metadata line up, and see if the unit tests all 
still pass?

 Metadata not extracted before the context in OOXML (pptx)
 -

 Key: TIKA-1109
 URL: https://issues.apache.org/jira/browse/TIKA-1109
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
Priority: Critical
 Fix For: 1.5


 It seems that when processing OOXML documents, the metadata is only read 
 after the text. This means it's impossible to use the medata while processing 
 the text. I think it would be more useful to have the metadata populated 
 first.
 As a symptom:
 java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
 outputs only as metadata:
 meta name=Content-Length content=36518/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=resourceName content=testPPT.pptx/
 while there is more medata in the file (e.g. dc:titleAttachment 
 Test/dc:title).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1053) Upgrade Tika Parsers to use ASM 4.x

2013-06-27 Thread Vincent Massol (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694624#comment-13694624
 ] 

Vincent Massol commented on TIKA-1053:
--

Thanks again for fixing this. Any idea when Tika 1.4 is going to be released? 
(I'm still waiting for this fix).

 Upgrade Tika Parsers to use ASM 4.x
 ---

 Key: TIKA-1053
 URL: https://issues.apache.org/jira/browse/TIKA-1053
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.2
Reporter: Vincent Massol
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1053.patch


 Right now Tika 1.2 uses ASM 3.1. 
 However this is causing some issues for us on the XWiki project since we also 
 bundle other framework that use a more recent version of ASM (we use pegdown 
 which uses parboiled which draws ASM 4.0).
 The problem is that ASM 3.x and 4.0 are not compatible...
 See http://jira.xwiki.org/browse/XE-1269 for more details about the issue 
 we're facing.
 Thanks for considering upgrading to ASM 4.x :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

2013-06-27 Thread Daniel Bonniot de Ruisselet (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Bonniot de Ruisselet updated TIKA-1109:
--

Summary: Metadata not extracted before the content in OOXML (pptx)  (was: 
Metadata not extracted before the context in OOXML (pptx))

 Metadata not extracted before the content in OOXML (pptx)
 -

 Key: TIKA-1109
 URL: https://issues.apache.org/jira/browse/TIKA-1109
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
Priority: Critical
 Fix For: 1.5


 It seems that when processing OOXML documents, the metadata is only read 
 after the text. This means it's impossible to use the medata while processing 
 the text. I think it would be more useful to have the metadata populated 
 first.
 As a symptom:
 java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
 outputs only as metadata:
 meta name=Content-Length content=36518/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=resourceName content=testPPT.pptx/
 while there is more medata in the file (e.g. dc:titleAttachment 
 Test/dc:title).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

2013-06-27 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694659#comment-13694659
 ] 

Daniel Bonniot de Ruisselet commented on TIKA-1109:
---

I tried it. It broke two tests (same cause): as you mentioned, in excel the 
metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a 
change in how that is implemented, and:

{{[INFO] 
}}
{{[INFO] Building Apache Tika 1.5-SNAPSHOT}}
{{[INFO] 
}}
{{[INFO]}}
{{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}}
{{[INFO] 
}}
{{[INFO] Reactor Summary:}}
{{[INFO]}}
{{[INFO] Apache Tika parent  SUCCESS [0.806s]}}
{{[INFO] Apache Tika core .. SUCCESS [8.418s]}}
{{[INFO] Apache Tika parsers ... SUCCESS [26.857s]}}
{{[INFO] Apache Tika XMP ... SUCCESS [0.789s]}}
{{[INFO] Apache Tika application ... SUCCESS [3.336s]}}
{{[INFO] Apache Tika OSGi bundle ... SUCCESS [1.204s]}}
{{[INFO] Apache Tika server  SUCCESS [5.312s]}}
{{[INFO] Apache Tika ... SUCCESS [0.014s]}}
{{[INFO] 
}}
{{[INFO] BUILD SUCCESS}}
{{[INFO] 
}}
{{[INFO] Total time: 47.498s}}
{{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}}
{{[INFO] Final Memory: 27M/1930M}}
{{[INFO] 
}}

{{dbonniot@naming:~/world/tika$ svn diff | diffstat}}
{{ main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java  
 |   11 -}}
{{ 
main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
 |   36 ++}}
{{ test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
 |   56 ++}}
{{ 3 files changed, 74 insertions(+), 29 deletions(-)}}

{{dbonniot@naming:~/world/tika$ svn diff  /tmp/TIKA-1109.patch}}


The logic is OOXMLExtractorFactory is now simpler, since I could remove the 
extra shielding, which I suppose was made necessary by the previous ordering.

And the metadata for OOXML formats is now available at parse time, as tested by 
the added test to OOXMLParserTest :)


 Metadata not extracted before the content in OOXML (pptx)
 -

 Key: TIKA-1109
 URL: https://issues.apache.org/jira/browse/TIKA-1109
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
Priority: Critical
 Fix For: 1.5


 It seems that when processing OOXML documents, the metadata is only read 
 after the text. This means it's impossible to use the medata while processing 
 the text. I think it would be more useful to have the metadata populated 
 first.
 As a symptom:
 java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
 outputs only as metadata:
 meta name=Content-Length content=36518/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=resourceName content=testPPT.pptx/
 while there is more medata in the file (e.g. dc:titleAttachment 
 Test/dc:title).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

2013-06-27 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694659#comment-13694659
 ] 

Daniel Bonniot de Ruisselet edited comment on TIKA-1109 at 6/27/13 12:27 PM:
-

I tried it. It broke two tests (same cause): as you mentioned, in excel the 
metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a 
change in how that is implemented, and:

{{[INFO] 
}}
{{[INFO] Building Apache Tika 1.5-SNAPSHOT}}
{{[INFO] 
}}
{{[INFO]}}
{{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}}
{{[INFO] 
}}
{{[INFO] Reactor Summary:}}
{{[INFO]}}
{{[INFO] Apache Tika parent  SUCCESS [0.806s]}}
{{[INFO] Apache Tika core .. SUCCESS [8.418s]}}
{{[INFO] Apache Tika parsers ... SUCCESS [26.857s]}}
{{[INFO] Apache Tika XMP ... SUCCESS [0.789s]}}
{{[INFO] Apache Tika application ... SUCCESS [3.336s]}}
{{[INFO] Apache Tika OSGi bundle ... SUCCESS [1.204s]}}
{{[INFO] Apache Tika server  SUCCESS [5.312s]}}
{{[INFO] Apache Tika ... SUCCESS [0.014s]}}
{{[INFO] 
}}
{{[INFO] BUILD SUCCESS}}
{{[INFO] 
}}
{{[INFO] Total time: 47.498s}}
{{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}}
{{[INFO] Final Memory: 27M/1930M}}
{{[INFO] 
}}

{{dbonniot@naming:~/world/tika$ svn diff | diffstat}}
{{main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java   
|   11 -}}
{{main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
 |   36 ++}}
{{test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java 
|   56 ++}}
{{3 files changed, 74 insertions(+), 29 deletions(-)}}

{{dbonniot@naming:~/world/tika$ svn diff  /tmp/TIKA-1109.patch}}


The logic is OOXMLExtractorFactory is now simpler, since I could remove the 
extra shielding, which I suppose was made necessary by the previous ordering.

And the metadata for OOXML formats is now available at parse time, as tested by 
the added test to OOXMLParserTest :)


  was (Author: dbr):
I tried it. It broke two tests (same cause): as you mentioned, in excel the 
metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a 
change in how that is implemented, and:

{{[INFO] 
}}
{{[INFO] Building Apache Tika 1.5-SNAPSHOT}}
{{[INFO] 
}}
{{[INFO]}}
{{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}}
{{[INFO] 
}}
{{[INFO] Reactor Summary:}}
{{[INFO]}}
{{[INFO] Apache Tika parent  SUCCESS [0.806s]}}
{{[INFO] Apache Tika core .. SUCCESS [8.418s]}}
{{[INFO] Apache Tika parsers ... SUCCESS [26.857s]}}
{{[INFO] Apache Tika XMP ... SUCCESS [0.789s]}}
{{[INFO] Apache Tika application ... SUCCESS [3.336s]}}
{{[INFO] Apache Tika OSGi bundle ... SUCCESS [1.204s]}}
{{[INFO] Apache Tika server  SUCCESS [5.312s]}}
{{[INFO] Apache Tika ... SUCCESS [0.014s]}}
{{[INFO] 
}}
{{[INFO] BUILD SUCCESS}}
{{[INFO] 
}}
{{[INFO] Total time: 47.498s}}
{{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}}
{{[INFO] Final Memory: 27M/1930M}}
{{[INFO] 
}}

{{dbonniot@naming:~/world/tika$ svn diff | diffstat}}
{{ main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java  
 |   11 -}}
{{ 
main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
 |   36 ++}}
{{ test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
 |   56 ++}}
{{ 3 files changed, 74 insertions(+), 29 deletions(-)}}

{{dbonniot@naming:~/world/tika$ svn diff  /tmp/TIKA-1109.patch}}


The logic is OOXMLExtractorFactory is now 

[jira] [Updated] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

2013-06-27 Thread Daniel Bonniot de Ruisselet (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Bonniot de Ruisselet updated TIKA-1109:
--

Attachment: TIKA-1109.patch

 Metadata not extracted before the content in OOXML (pptx)
 -

 Key: TIKA-1109
 URL: https://issues.apache.org/jira/browse/TIKA-1109
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
Priority: Critical
 Fix For: 1.5

 Attachments: TIKA-1109.patch


 It seems that when processing OOXML documents, the metadata is only read 
 after the text. This means it's impossible to use the medata while processing 
 the text. I think it would be more useful to have the metadata populated 
 first.
 As a symptom:
 java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
 outputs only as metadata:
 meta name=Content-Length content=36518/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=resourceName content=testPPT.pptx/
 while there is more medata in the file (e.g. dc:titleAttachment 
 Test/dc:title).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

2013-06-27 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1109.
--

Resolution: Fixed

Thanks for the patch, making life easier for downstream users AND tidying the 
code at the same time is always a win :)

Applied (with a slight comment tweak) in r1497332.

 Metadata not extracted before the content in OOXML (pptx)
 -

 Key: TIKA-1109
 URL: https://issues.apache.org/jira/browse/TIKA-1109
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
Priority: Critical
  Labels: patch
 Fix For: 1.5

 Attachments: TIKA-1109.patch


 It seems that when processing OOXML documents, the metadata is only read 
 after the text. This means it's impossible to use the medata while processing 
 the text. I think it would be more useful to have the metadata populated 
 first.
 As a symptom:
 java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
 outputs only as metadata:
 meta name=Content-Length content=36518/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=resourceName content=testPPT.pptx/
 while there is more medata in the file (e.g. dc:titleAttachment 
 Test/dc:title).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1053) Upgrade Tika Parsers to use ASM 4.x

2013-06-27 Thread Vincent Massol (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694677#comment-13694677
 ] 

Vincent Massol commented on TIKA-1053:
--

Cool, thanks Nick. 

 Upgrade Tika Parsers to use ASM 4.x
 ---

 Key: TIKA-1053
 URL: https://issues.apache.org/jira/browse/TIKA-1053
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.2
Reporter: Vincent Massol
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1053.patch


 Right now Tika 1.2 uses ASM 3.1. 
 However this is causing some issues for us on the XWiki project since we also 
 bundle other framework that use a more recent version of ASM (we use pegdown 
 which uses parboiled which draws ASM 4.0).
 The problem is that ASM 3.x and 4.0 are not compatible...
 See http://jira.xwiki.org/browse/XE-1269 for more details about the issue 
 we're facing.
 Thanks for considering upgrading to ASM 4.x :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-27 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694741#comment-13694741
 ] 

Nick Burch commented on TIKA-973:
-

Patch looks promising to me, but I don't know enough about PDF so I've not been 
able to give it a thorough review

Let's give it a few days before applying, to give others a chance to offer 
feedback

One thing that might be good is in the unit test, to check for data from each 
form in turn, so we cover more cases

 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: TIKA-973-patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-27 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694774#comment-13694774
 ] 

Tim Allison commented on TIKA-973:
--

Agree on both.  Also would appreciate feedback on what the output should be.  
The current code extracts this unseemly xhtml:

div class=acroform
 ol   li partialName=form1[0] fullName=form1[0]/
 ol   li partialName=#subform[6] fullName=form1[0].#subform[6]/
li partialName=MiddleInitial[0] 
fullName=form1[0].#subform[6].MiddleInitial[0] altName=Enter Middle Initial 
(MI)X/li
 li partialName=FamilyName[0] 
fullName=form1[0].#subform[6].FamilyName[0] altName=Section 1. Employee 
Information and Attestation.  Family Name (Last Name)Doe/li
li partialName=GivenName[0] 
fullName=form1[0].#subform[6].GivenName[0] altName=Given Name (First 
Name)John/li
li partialName=OtherNamesUsed[0] 
fullName=form1[0].#subform[6].OtherNamesUsed[0] altName=Maiden NameMr. 
Doe/li
li partialName=StreetNumberName[0] 
fullName=form1[0].#subform[6].StreetNumberName[0] altName= Street Number and 
Name123 Main St./li


...

Another idea I had was to include the partialName in the contents and not fill 
out the attrs:
liStreetNumberName[0]: 123 Main St/li

More unit tests on way...

 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: TIKA-973-patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-27 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694782#comment-13694782
 ] 

Nick Burch commented on TIKA-973:
-

To make reviewing easier, it might be handy if you could upload a PNG 
screenshot of one of these forms, so it's quick to view that alongside 
suggested html

I'd be minded to go for something like:
  li title=Street Number and NameStreetNumberName[0]: 123 Main St/li

So we'd have the alt name, the partial name, the value, but not the full name 
(but we would have the form/subform name elsewhere)

 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: TIKA-973-patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-27 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-973:
-

Attachment: i-9_screenshot.png

Screenshot attached.  Thanks again to: 
http://benlitchfield.sys-con.com/node/48543?page=0,1 for the code example and 
example doc.

The middle ground that you recommend makes sense.



 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: i-9_screenshot.png, TIKA-973-patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira