[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283148#comment-14283148 ] Uwe Schindler commented on TIKA-1523: - Hi, I did some recherche: This is a bug in Word 2000 (aka Word 9.0) fixed in later versions. Indeed the page count is wrong initially on saving, if you don't scroll to the end. People were complaining about that at that time, too, because it caused sometimes the total page number in footnotes to be incorrect, too. http://support.microsoft.com/kb/212653/en-us See also: http://www.ms-office-forum.net/forum/archive/index.php?t-125861.html (German only, 1st comment): {quote} SSD 26.04.2004, 21:07 Ich übernehme die Seitenzahl aus den Eigenschaften einer Word-Datei in Access. Jetzt habe ich das Problem, daß wenn die Datei in Word geöffnet ist in den Eigenschaften die richtige Seitenzahl angezeigt wird. Es die Datei geschossen und ich gehe im Fenster öffnen auf die Eigenschaften stimmt die Seitenzahl (steht immer erstmal 1 Seite) erst nach mehrmaligem speichern der Datei, woran kann das liegen, wie kann ich das ändern? {quote} And: https://groups.google.com/forum/#!topic/microsoft.public.word.vba.general/daf-sUpPlgs You see, initially the page count is wrong. If you open a file with Word 2000 / 9.0 and safe it without waiting until the full count was calculated (computers were slower at that time), it saved 1. :-) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283116#comment-14283116 ] Uwe Schindler edited comment on TIKA-1523 at 1/19/15 10:50 PM: --- Yes. I extracts just the metadata with COM interface for the quickview windows component (you don't even need Word installed for that). So I think this is an issue with this old version of Word. In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows 1. But here, the metadata as saved in the file is simply 1 or maybe nothing (see below). POI does not reflow the layout to calculate that information. This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file read only (which it does because its downloaded from internet), it shows in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue. was (Author: thetaphi): Yes. I extracts just the metadata. So I think this is an issue with this old version of Word. In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows 1. But here, the metadata as saved in the file is simply 1 or maybe nothing (see below). POI does not reflow the layout to calculate that information. This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file read only (which it does because its downloaded from internet), it shows in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue. metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1523: Attachment: screenshot-2.png metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1523: Attachment: (was: screenshot-2.png) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283102#comment-14283102 ] Konstantin Gribov commented on TIKA-1523: - And libreoffice shows 10 pages, as I see in this doc. I thought about filing a bug to POI since {{org.apache.poi.hpsf.SummaryInformation#getPageCount()}} returns 1 for this file. Does this properties window use native Word COM interface? metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1523: Attachment: screenshot-2.png metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283116#comment-14283116 ] Uwe Schindler commented on TIKA-1523: - Yes. I extracts just the metadata. So I think this is an issue with this old version of Word. In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows 1. But here, the metadata as saved in the file is simply 1 or maybe nothing (see below). POI does not reflow the layout to calculate that information. This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file read only (which it does because its downloaded from internet), it shows in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue. metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283148#comment-14283148 ] Uwe Schindler edited comment on TIKA-1523 at 1/19/15 11:16 PM: --- Hi, I did some recherche: This is a bug in Word 2000 (aka Word 9.0) fixed in later versions. Indeed the page count is wrong initially on saving, if you don't scroll to the end. People were complaining about that at that time, too, because it caused sometimes the total page number in footnotes to be incorrect, too. http://support.microsoft.com/kb/212653/en-us See also: http://www.ms-office-forum.net/forum/archive/index.php?t-125861.html (German only, 1st comment): {quote} SSD 26.04.2004, 21:07 Ich übernehme die Seitenzahl aus den Eigenschaften einer Word-Datei in Access. Jetzt habe ich das Problem, daß wenn die Datei in Word geöffnet ist in den Eigenschaften die richtige Seitenzahl angezeigt wird. Es die Datei geschossen und ich gehe im Fenster öffnen auf die Eigenschaften stimmt die Seitenzahl (steht immer erstmal 1 Seite) erst nach mehrmaligem speichern der Datei, woran kann das liegen, wie kann ich das ändern? {quote} And: https://groups.google.com/forum/#!topic/microsoft.public.word.vba.general/daf-sUpPlgs {quote} Anyone can help me with this? If I take out Sleep 1, myDoc.BuiltinDocumentProperties(wdPropertyPages) doesnt return the correct number of pages sometimes. For example, if a document has 200 pages, it may come out to return 140, or sometimes 199, instead of 200. To me, it seems it takes some time for MS word to think and get the number of pages. After i put Sleep 1, 99% I got the correct number of pages. However, this will take very long time to process as I need to read 200 to 300 files and the number of pages from each files. Please let me know if there is another better solution for this. {quote} You see, initially the page count is wrong. If you open a file with Word 2000 / 9.0 and save it without waiting until the full count was calculated (computers were slower at that time), it saved 1. :-) was (Author: thetaphi): Hi, I did some recherche: This is a bug in Word 2000 (aka Word 9.0) fixed in later versions. Indeed the page count is wrong initially on saving, if you don't scroll to the end. People were complaining about that at that time, too, because it caused sometimes the total page number in footnotes to be incorrect, too. http://support.microsoft.com/kb/212653/en-us See also: http://www.ms-office-forum.net/forum/archive/index.php?t-125861.html (German only, 1st comment): {quote} SSD 26.04.2004, 21:07 Ich übernehme die Seitenzahl aus den Eigenschaften einer Word-Datei in Access. Jetzt habe ich das Problem, daß wenn die Datei in Word geöffnet ist in den Eigenschaften die richtige Seitenzahl angezeigt wird. Es die Datei geschossen und ich gehe im Fenster öffnen auf die Eigenschaften stimmt die Seitenzahl (steht immer erstmal 1 Seite) erst nach mehrmaligem speichern der Datei, woran kann das liegen, wie kann ich das ändern? {quote} And: https://groups.google.com/forum/#!topic/microsoft.public.word.vba.general/daf-sUpPlgs You see, initially the page count is wrong. If you open a file with Word 2000 / 9.0 and safe it without waiting until the full count was calculated (computers were slower at that time), it saved 1. :-) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283134#comment-14283134 ] Konstantin Gribov commented on TIKA-1523: - [~thetaphi], thank you for digging into that. I'll close this issue. metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-1523. - Resolution: Won't Fix metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov reassigned TIKA-1523: --- Assignee: Konstantin Gribov metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1523: Attachment: screenshot-1.png metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283092#comment-14283092 ] Uwe Schindler commented on TIKA-1523: - If I save the file with Office 2010, the page number is updated and shows correct in right-click/Properties. TIKA also shows it. metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Tika Server docker image
On Mon, 19 Jan 2015, Konstantin Gribov wrote: There's no Apache docker registry (see INFRA-9035 and INFRA-8441). There's no docker hub intergration with apache repos, as far as I know. So there's no way to create some official docker build currently. Your best bet is probably to hop in the infra hipchat, find out what the approximate worries are, then start working on a proposed solution on either JIRA or infrastructure-dev@ as appropriate. From reading the two issues, I don't see it as being something infra will magically just fix in the next few weeks, in the absense of knowledgeable volunteers helping. (Contact details at https://www.apache.org/dev/infra-contact) Is unofficial image with automated build a reasonable answer to TIKA-1518 since we can't provide official images yet? I don't see why we couldn't hold the build tool/script in svn, but hosting the generated images would need to be external and somewhat unoffical for now. Working with Infra is our best bet for getting it all official and supported! Nick
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281086#comment-14281086 ] Luis Filipe Nassif edited comment on TIKA-1511 at 1/19/15 12:01 PM: If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not be read, I think using EDE is not useful. How will this approach work with TikaCli --extract option? My original idea was to support an use case to extract each table to one file... Now I think this extraction of tables to files can be done handling the db as one big doc and using a ContentHandlerDecorator that will split the xhtml output at table boundaries. Each xhtml segment can be converted to a byte[] (if small) and then to a ByteArrayInputStream that can be handled by an EmbeddedDocExtractor, if setted into parseContext. If not setted, the ContentHandlerDecorator do not need to split the xhtml output and can fallback to default behavior. Then A custom EDE can extract tables to files if desired. So now I think the big doc approah is not bad. What do you think? was (Author: lfcnassif): If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not be read, I think using EDE is not useful. How will this approach work with TikaCli --extract option? My original idea was to support an use case like TikaCli --extract... Now I think this extraction of tables to files can be done handling the db as one big doc and using a ContentHandlerDecorator that will split the xhtml output at table boundaries. Each xhtml segment can be converted to a byte[] (if small) and then to a ByteArrayInputStream that can be handled by an EmbeddedDocDecorator, if setted into parseContext. If not setted the ContentHandlerDecorator do not need to split tables and can fallback to default behavior. A custom EDE can then extract tables to files if desired. So now I think we could go with the big doc approah. What do you think? Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1522) Exe being detected as application/x-msdownload
[ https://issues.apache.org/jira/browse/TIKA-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282874#comment-14282874 ] Nick Burch edited comment on TIKA-1522 at 1/19/15 7:08 PM: --- For that file, I'm getting {{application/x-msdownload; format=pe32}} from Tika App, which is a subtype of {{application/x-msdownload;format=pe}} which is a subtype of {{application/x-msdownload}} {{application/x-dosexec}} is the mimetype which has the {{.exe}} glob, and that's also a subtype of {{application/x-msdownload}} The problem is that a windows 32 bit DLL will also be detected as {{application/x-msdownload; format=pe32}}, and that shouldn't be {{application/x-dosexec}} This might need some thought... If anyone has some time, it'd be interesting to know what other mime-based tools use for windows DLLs and EXEs was (Author: gagravarr): For that file, I'm getting {{{application/x-msdownload; format=pe32}}} from Tika App, which is a subtype of {{{application/x-msdownload;format=pe}}} which is a subtype of {{{application/x-msdownload}}} {{{application/x-dosexec}}} is the mimetype which has the {{{.exe}}} glob, and that's also a subtype of {{{application/x-msdownload}}} The problem is that a windows 32 bit DLL will also be detected as {{{application/x-msdownload; format=pe32}}}, and that shouldn't be {{{application/x-dosexec}}} This might need some thought... Exe being detected as application/x-msdownload -- Key: TIKA-1522 URL: https://issues.apache.org/jira/browse/TIKA-1522 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.7 Reporter: Luis Filipe Nassif Assignee: Nick Burch Priority: Minor Attachments: Search.exe If it is ok, *.exe must be included in application/x-msdownload glob pattern definitions. If it should be detected as application/x-dosexec, the hierarchy between application/x-dosexec, application/x-msdownload and PE based formats must be changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1522) Exe being detected as application/x-msdownload
Luis Filipe Nassif created TIKA-1522: Summary: Exe being detected as application/x-msdownload Key: TIKA-1522 URL: https://issues.apache.org/jira/browse/TIKA-1522 Project: Tika Issue Type: Bug Components: config Affects Versions: 1.7 Reporter: Luis Filipe Nassif Priority: Minor If it is ok, *.exe must be included in application/x-msdownload glob pattern definitions. If it should be detected as application/x-dosexec, the hierarchy between application/x-dosexec, application/x-msdownload and PE based formats must be changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1522) Exe being detected as application/x-msdownload
[ https://issues.apache.org/jira/browse/TIKA-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-1522: - Component/s: (was: config) mime Exe being detected as application/x-msdownload -- Key: TIKA-1522 URL: https://issues.apache.org/jira/browse/TIKA-1522 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.7 Reporter: Luis Filipe Nassif Assignee: Nick Burch Priority: Minor Attachments: Search.exe If it is ok, *.exe must be included in application/x-msdownload glob pattern definitions. If it should be detected as application/x-dosexec, the hierarchy between application/x-dosexec, application/x-msdownload and PE based formats must be changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TIKA-1522) Exe being detected as application/x-msdownload
[ https://issues.apache.org/jira/browse/TIKA-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch reassigned TIKA-1522: Assignee: Nick Burch Exe being detected as application/x-msdownload -- Key: TIKA-1522 URL: https://issues.apache.org/jira/browse/TIKA-1522 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.7 Reporter: Luis Filipe Nassif Assignee: Nick Burch Priority: Minor Attachments: Search.exe If it is ok, *.exe must be included in application/x-msdownload glob pattern definitions. If it should be detected as application/x-dosexec, the hierarchy between application/x-dosexec, application/x-msdownload and PE based formats must be changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1522) Exe being detected as application/x-msdownload
[ https://issues.apache.org/jira/browse/TIKA-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Filipe Nassif updated TIKA-1522: - Attachment: Search.exe Example PE file that triggers the issue. Exe being detected as application/x-msdownload -- Key: TIKA-1522 URL: https://issues.apache.org/jira/browse/TIKA-1522 Project: Tika Issue Type: Bug Components: config Affects Versions: 1.7 Reporter: Luis Filipe Nassif Priority: Minor Attachments: Search.exe If it is ok, *.exe must be included in application/x-msdownload glob pattern definitions. If it should be detected as application/x-dosexec, the hierarchy between application/x-dosexec, application/x-msdownload and PE based formats must be changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1522) Exe being detected as application/x-msdownload
[ https://issues.apache.org/jira/browse/TIKA-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282874#comment-14282874 ] Nick Burch commented on TIKA-1522: -- For that file, I'm getting {{{application/x-msdownload; format=pe32}}} from Tika App, which is a subtype of {{{application/x-msdownload;format=pe}}} which is a subtype of {{{application/x-msdownload}}} {{{application/x-dosexec}}} is the mimetype which has the {{{.exe}}} glob, and that's also a subtype of {{{application/x-msdownload}}} The problem is that a windows 32 bit DLL will also be detected as {{{application/x-msdownload; format=pe32}}}, and that shouldn't be {{{application/x-dosexec}}} This might need some thought... Exe being detected as application/x-msdownload -- Key: TIKA-1522 URL: https://issues.apache.org/jira/browse/TIKA-1522 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.7 Reporter: Luis Filipe Nassif Assignee: Nick Burch Priority: Minor Attachments: Search.exe If it is ok, *.exe must be included in application/x-msdownload glob pattern definitions. If it should be detected as application/x-dosexec, the hierarchy between application/x-dosexec, application/x-msdownload and PE based formats must be changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1522) Exe being detected as application/x-msdownload
[ https://issues.apache.org/jira/browse/TIKA-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-1522: - Assignee: (was: Nick Burch) Exe being detected as application/x-msdownload -- Key: TIKA-1522 URL: https://issues.apache.org/jira/browse/TIKA-1522 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.7 Reporter: Luis Filipe Nassif Priority: Minor Attachments: Search.exe If it is ok, *.exe must be included in application/x-msdownload glob pattern definitions. If it should be detected as application/x-dosexec, the hierarchy between application/x-dosexec, application/x-msdownload and PE based formats must be changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
Yamileydis Veranes created TIKA-1523: Summary: metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Reporter: Yamileydis Veranes When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yamileydis Veranes updated TIKA-1523: - Environment: Ubuntu metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yamileydis Veranes updated TIKA-1523: - Attachment: Sigmund Freud.doc metadata extractor obtains an incorrect number of pages to this document metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Attachments: Sigmund Freud.doc When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)