[ 
https://issues.apache.org/jira/browse/TIKA-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Cole updated TIKA-1178:
-----------------------------

    Description: 
Currently docx to plain text is only accurate for single page files. First off, 
the sectPr tag right above the closing body tag is not the overall document 
property; it is the section property of the last section(if there is only one, 
then yes it is the overall document property per say). right now if I had a 
large docx file (let's say a book which i broke each chapter into it's own 
section) then i would get the last chapter's header as the beginning document's 
header.

Addressing sectPr tags inside paragraphs:
why are we wrapping the paragraph with the header and footer?
we should be buffering up pages as we read the docx file, until we hit a 
section property where we decide how to wrap what we just consumed. I realize 
that it is difficult to determine page breaks when it is caused by overflow 
(not explicit page breaks). 

The time for completion is really dependent on how much improvement we want to 
add in this area.

Just for reference, my assumptions on open office xml structure interpretation 
come from the documentation on this site: 
http://www.ecma-international.org/publications/standards/Ecma-376.htm

UPDATE:

sample code, test files, and output.

    InputStream in = new FileInputStream(test);
    
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    
    OOXMLExtractorFactory.parse(in, handler, metadata, new ParseContext());

    String text = handler.toString();

    System.out.println(text);

given a file with 3 pages, a section on each page, and a default header and 
footer (odd) for each section. for reading convenience, the text listed below 
describes itself. ie. "Header1" means the first page header text, ect.

Here is a sample file(:

Header 1

First paragraph on page 1
Second paragraph on page 1

Footer 1

Header 2

First paragraph on page 2
Second paragraph on page 2

Footer 2

Header 3

First paragraph on page 3
Second paragraph on page 3

Footer 3


the output I get is:

Header 3
First paragraph on page 1
Header 1
Second paragraph on page 1
Footer 1
First paragraph on page 2
Header 2
Second paragraph on page 2
Footer 2
First paragraph on page 3
Second paragraph on page 3
Footer 3


Here is another file with only 1 section with first, odd, and even headers used 
(3pages_1section_FirstEvenOddHeaderFooter_mod.docx):

First page header

First paragraph on page 1
Second paragraph on page 1

First page footer

Second page header (even)

First paragraph on page 2
Second paragraph on page 2

Second page footer (even)

Third page header (odd)

First paragraph on page 3
Second paragraph on page 3

Third page footer (odd)


actual output:

First page header
Second page header (even)
Third page header (odd)
First paragraph on page 1
Second paragraph on page 1


First paragraph on page 2
Second paragraph on page 2


First paragraph on page 3
Second paragraph on page 3
First page footer
Second page footer (even)
Third page footer (odd)

  was:
Currently docx to plain text is only accurate for single page files. First off, 
the sectPr tag right above the closing body tag is not the overall document 
property; it is the section property of the last section(if there is only one, 
then yes it is the overall document property per say). right now if I had a 
large docx file (let's say a book which i broke each chapter into it's own 
section) then i would get the last chapter's header as the beginning document's 
header.

Addressing sectPr tags inside paragraphs:
why are we wrapping the paragraph with the header and footer?
we should be buffering up pages as we read the docx file, until we hit a 
section property where we decide how to wrap what we just consumed. I realize 
that it is difficult to determine page breaks when it is caused by overflow 
(not explicit page breaks). 

The time for completion is really dependent on how much improvement we want to 
add in this area.

Just for reference, my assumptions on open office xml structure interpretation 
come from the documentation on this site: 
http://www.ecma-international.org/publications/standards/Ecma-376.htm

UPDATE:

sample code, test files, and output.

    InputStream in = new FileInputStream(test);
    
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    
    OOXMLExtractorFactory.parse(in, handler, metadata, new ParseContext());

    String text = handler.toString();

    System.out.println(text);

given a file with 3 pages, a section on each page, and a default header and 
footer (odd) for each section. for reading convenience, the text listed below 
describes itself. ie. "Header1" means the first page header text, ect.

Here is a sample file(:

Header 1

First paragraph on page 1
Second paragraph on page 1

Footer 1

Header 2

First paragraph on page 2
Second paragraph on page 2

Footer 2

Header 3

First paragraph on page 3
Second paragraph on page 3

Footer 3


the output I get is:

Header 3
First paragraph on page 1
Header 1
Second paragraph on page 1
Footer 1
First paragraph on page 2
Header 2
Second paragraph on page 2
Footer 2
First paragraph on page 3
Second paragraph on page 3
Footer 3



> Improve docx multiple section handling - headers and footers
> ------------------------------------------------------------
>
>                 Key: TIKA-1178
>                 URL: https://issues.apache.org/jira/browse/TIKA-1178
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: David Cole
>            Priority: Minor
>              Labels: docx, parsing, sectPr
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently docx to plain text is only accurate for single page files. First 
> off, the sectPr tag right above the closing body tag is not the overall 
> document property; it is the section property of the last section(if there is 
> only one, then yes it is the overall document property per say). right now if 
> I had a large docx file (let's say a book which i broke each chapter into 
> it's own section) then i would get the last chapter's header as the beginning 
> document's header.
> Addressing sectPr tags inside paragraphs:
> why are we wrapping the paragraph with the header and footer?
> we should be buffering up pages as we read the docx file, until we hit a 
> section property where we decide how to wrap what we just consumed. I realize 
> that it is difficult to determine page breaks when it is caused by overflow 
> (not explicit page breaks). 
> The time for completion is really dependent on how much improvement we want 
> to add in this area.
> Just for reference, my assumptions on open office xml structure 
> interpretation come from the documentation on this site: 
> http://www.ecma-international.org/publications/standards/Ecma-376.htm
> UPDATE:
> sample code, test files, and output.
>     InputStream in = new FileInputStream(test);
>     
>     ContentHandler handler = new BodyContentHandler();
>     Metadata metadata = new Metadata();
>     
>     OOXMLExtractorFactory.parse(in, handler, metadata, new ParseContext());
>     String text = handler.toString();
>     System.out.println(text);
> given a file with 3 pages, a section on each page, and a default header and 
> footer (odd) for each section. for reading convenience, the text listed below 
> describes itself. ie. "Header1" means the first page header text, ect.
> Here is a sample file(:
> Header 1
> First paragraph on page 1
> Second paragraph on page 1
> Footer 1
> Header 2
> First paragraph on page 2
> Second paragraph on page 2
> Footer 2
> Header 3
> First paragraph on page 3
> Second paragraph on page 3
> Footer 3
> the output I get is:
> Header 3
> First paragraph on page 1
> Header 1
> Second paragraph on page 1
> Footer 1
> First paragraph on page 2
> Header 2
> Second paragraph on page 2
> Footer 2
> First paragraph on page 3
> Second paragraph on page 3
> Footer 3
> Here is another file with only 1 section with first, odd, and even headers 
> used (3pages_1section_FirstEvenOddHeaderFooter_mod.docx):
> First page header
> First paragraph on page 1
> Second paragraph on page 1
> First page footer
> Second page header (even)
> First paragraph on page 2
> Second paragraph on page 2
> Second page footer (even)
> Third page header (odd)
> First paragraph on page 3
> Second paragraph on page 3
> Third page footer (odd)
> actual output:
> First page header
> Second page header (even)
> Third page header (odd)
> First paragraph on page 1
> Second paragraph on page 1
> First paragraph on page 2
> Second paragraph on page 2
> First paragraph on page 3
> Second paragraph on page 3
> First page footer
> Second page footer (even)
> Third page footer (odd)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to