[
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283116#comment-14283116
]
Uwe Schindler commented on TIKA-1523:
-------------------------------------
Yes. I extracts just the metadata. So I think this is an issue with this old
version of Word.
In fact when you open the file in Word, it of course shows the real pages and
it also recalculates the count, but initially it also shows "1". But here, the
metadata as saved in the file is simply "1" or maybe nothing (see below). POI
does not "reflow" the layout to calculate that information.
This is why the metadata is only updated by the word processing program on
opening and editing the file. If you instruct Word 2010 to open the file "read
only" (which it does because its downloaded from internet), it shows "" in the
page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or
POI's issue.
> metadata extractor gets the wrong number of pages of some documents Microsoft
> Word 9.0
> --------------------------------------------------------------------------------------
>
> Key: TIKA-1523
> URL: https://issues.apache.org/jira/browse/TIKA-1523
> Project: Tika
> Issue Type: Bug
> Components: metadata
> Affects Versions: 1.7
> Environment: Ubuntu
> Reporter: Yamileydis Veranes
> Assignee: Konstantin Gribov
> Attachments: Sigmund Freud.doc, screenshot-1.png
>
>
> When I extract the metadata from a Microsoft Word 9.0 document which has 10
> pages extractor gives me the result that only has 1 page.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)