[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283148#comment-14283148
 ] 

Uwe Schindler commented on TIKA-1523:
-

Hi, I did some recherche:
This is a bug in Word 2000 (aka Word 9.0) fixed in later versions. Indeed the 
page count is wrong initially on saving, if you don't scroll to the end. People 
were complaining about that at that time, too, because it caused sometimes the 
total page number in footnotes to be incorrect, too.

http://support.microsoft.com/kb/212653/en-us

See also: http://www.ms-office-forum.net/forum/archive/index.php?t-125861.html 
(German only, 1st comment):

{quote}
SSD 26.04.2004, 21:07
Ich übernehme die Seitenzahl aus den Eigenschaften einer Word-Datei in Access. 
Jetzt habe ich das Problem, daß wenn die Datei in Word geöffnet ist in den 
Eigenschaften die richtige Seitenzahl angezeigt wird. Es die Datei geschossen 
und ich gehe im Fenster öffnen auf die Eigenschaften stimmt die Seitenzahl 
(steht immer erstmal 1 Seite) erst nach mehrmaligem speichern der Datei, woran 
kann das liegen, wie kann ich das ändern?
{quote}

And: 
https://groups.google.com/forum/#!topic/microsoft.public.word.vba.general/daf-sUpPlgs

You see, initially the page count is wrong. If you open a file with Word 2000 / 
9.0 and safe it without waiting until the full count was calculated (computers 
were slower at that time), it saved 1. :-)

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283116#comment-14283116
 ] 

Uwe Schindler edited comment on TIKA-1523 at 1/19/15 10:50 PM:
---

Yes. I extracts just the metadata with COM interface for the quickview windows 
component (you don't even need Word installed for that). So I think this is an 
issue with this old version of Word.

In fact when you open the file in Word, it of course shows the real pages and 
it also recalculates the count, but initially it also shows 1. But here, the 
metadata as saved in the file is simply 1 or maybe nothing (see below). POI 
does not reflow the layout to calculate that information.

This is why the metadata is only updated by the word processing program on 
opening and editing the file. If you instruct Word 2010 to open the file read 
only (which it does because its downloaded from internet), it shows  in the 
page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or 
POI's issue.


was (Author: thetaphi):
Yes. I extracts just the metadata. So I think this is an issue with this old 
version of Word.

In fact when you open the file in Word, it of course shows the real pages and 
it also recalculates the count, but initially it also shows 1. But here, the 
metadata as saved in the file is simply 1 or maybe nothing (see below). POI 
does not reflow the layout to calculate that information.

This is why the metadata is only updated by the word processing program on 
opening and editing the file. If you instruct Word 2010 to open the file read 
only (which it does because its downloaded from internet), it shows  in the 
page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or 
POI's issue.

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1523:

Attachment: screenshot-2.png

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1523:

Attachment: (was: screenshot-2.png)

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283102#comment-14283102
 ] 

Konstantin Gribov commented on TIKA-1523:
-

And libreoffice shows 10 pages, as I see in this doc. I thought about filing a 
bug to POI since {{org.apache.poi.hpsf.SummaryInformation#getPageCount()}} 
returns 1 for this file.

Does this properties window use native Word COM interface?

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1523:

Attachment: screenshot-2.png

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283116#comment-14283116
 ] 

Uwe Schindler commented on TIKA-1523:
-

Yes. I extracts just the metadata. So I think this is an issue with this old 
version of Word.

In fact when you open the file in Word, it of course shows the real pages and 
it also recalculates the count, but initially it also shows 1. But here, the 
metadata as saved in the file is simply 1 or maybe nothing (see below). POI 
does not reflow the layout to calculate that information.

This is why the metadata is only updated by the word processing program on 
opening and editing the file. If you instruct Word 2010 to open the file read 
only (which it does because its downloaded from internet), it shows  in the 
page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or 
POI's issue.

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283148#comment-14283148
 ] 

Uwe Schindler edited comment on TIKA-1523 at 1/19/15 11:16 PM:
---

Hi, I did some recherche:
This is a bug in Word 2000 (aka Word 9.0) fixed in later versions. Indeed the 
page count is wrong initially on saving, if you don't scroll to the end. People 
were complaining about that at that time, too, because it caused sometimes the 
total page number in footnotes to be incorrect, too.

http://support.microsoft.com/kb/212653/en-us

See also: http://www.ms-office-forum.net/forum/archive/index.php?t-125861.html 
(German only, 1st comment):

{quote}
SSD 26.04.2004, 21:07
Ich übernehme die Seitenzahl aus den Eigenschaften einer Word-Datei in Access. 
Jetzt habe ich das Problem, daß wenn die Datei in Word geöffnet ist in den 
Eigenschaften die richtige Seitenzahl angezeigt wird. Es die Datei geschossen 
und ich gehe im Fenster öffnen auf die Eigenschaften stimmt die Seitenzahl 
(steht immer erstmal 1 Seite) erst nach mehrmaligem speichern der Datei, woran 
kann das liegen, wie kann ich das ändern?
{quote}

And: 
https://groups.google.com/forum/#!topic/microsoft.public.word.vba.general/daf-sUpPlgs

{quote}
Anyone can help me with this? If I take out Sleep 1,
myDoc.BuiltinDocumentProperties(wdPropertyPages) doesnt return the correct
number of pages sometimes. For example, if a document has 200 pages, it may
come out to return 140, or sometimes 199, instead of 200. To me, it seems it
takes some time for MS word to think and get the number of pages. After i
put Sleep 1, 99% I got the correct number of pages. However, this will
take very long time to process as I need to read 200 to 300 files and the
number of pages from each files. Please let me know if there is another
better solution for this.
{quote}

You see, initially the page count is wrong. If you open a file with Word 2000 / 
9.0 and save it without waiting until the full count was calculated (computers 
were slower at that time), it saved 1. :-)


was (Author: thetaphi):
Hi, I did some recherche:
This is a bug in Word 2000 (aka Word 9.0) fixed in later versions. Indeed the 
page count is wrong initially on saving, if you don't scroll to the end. People 
were complaining about that at that time, too, because it caused sometimes the 
total page number in footnotes to be incorrect, too.

http://support.microsoft.com/kb/212653/en-us

See also: http://www.ms-office-forum.net/forum/archive/index.php?t-125861.html 
(German only, 1st comment):

{quote}
SSD 26.04.2004, 21:07
Ich übernehme die Seitenzahl aus den Eigenschaften einer Word-Datei in Access. 
Jetzt habe ich das Problem, daß wenn die Datei in Word geöffnet ist in den 
Eigenschaften die richtige Seitenzahl angezeigt wird. Es die Datei geschossen 
und ich gehe im Fenster öffnen auf die Eigenschaften stimmt die Seitenzahl 
(steht immer erstmal 1 Seite) erst nach mehrmaligem speichern der Datei, woran 
kann das liegen, wie kann ich das ändern?
{quote}

And: 
https://groups.google.com/forum/#!topic/microsoft.public.word.vba.general/daf-sUpPlgs

You see, initially the page count is wrong. If you open a file with Word 2000 / 
9.0 and safe it without waiting until the full count was calculated (computers 
were slower at that time), it saved 1. :-)

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283134#comment-14283134
 ] 

Konstantin Gribov commented on TIKA-1523:
-

[~thetaphi], thank you for digging into that. I'll close this issue.

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Konstantin Gribov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-1523.
-
Resolution: Won't Fix

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Konstantin Gribov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov reassigned TIKA-1523:
---

Assignee: Konstantin Gribov

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1523:

Attachment: screenshot-1.png

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283092#comment-14283092
 ] 

Uwe Schindler commented on TIKA-1523:
-

If I save the file with Office 2010, the page number is updated and shows 
correct in right-click/Properties. TIKA also shows it.

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Tika Server docker image

2015-01-19 Thread Nick Burch

On Mon, 19 Jan 2015, Konstantin Gribov wrote:
There's no Apache docker registry (see INFRA-9035 and INFRA-8441). 
There's no docker hub intergration with apache repos, as far as I know. 
So there's no way to create some official docker build currently.


Your best bet is probably to hop in the infra hipchat, find out what the 
approximate worries are, then start working on a proposed solution on 
either JIRA or infrastructure-dev@ as appropriate. From reading the two 
issues, I don't see it as being something infra will magically just fix in 
the next few weeks, in the absense of knowledgeable volunteers helping.


(Contact details at https://www.apache.org/dev/infra-contact)

Is unofficial image with automated build a reasonable answer to 
TIKA-1518 since we can't provide official images yet?


I don't see why we couldn't hold the build tool/script in svn, but hosting 
the generated images would need to be external and somewhat unoffical for 
now. Working with Infra is our best bet for getting it all official and 
supported!


Nick


[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3

2015-01-19 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281086#comment-14281086
 ] 

Luis Filipe Nassif edited comment on TIKA-1511 at 1/19/15 12:01 PM:


If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not 
be read, I think using EDE is not useful. How will this approach work with 
TikaCli --extract option? My original idea was to support an use case to 
extract each table to one file...

Now I think this extraction of tables to files can be done handling the db as 
one big doc and using a ContentHandlerDecorator that will split the xhtml 
output at table boundaries. Each xhtml segment can be converted to a byte[] (if 
small) and then to a ByteArrayInputStream that can be handled by an 
EmbeddedDocExtractor, if setted into parseContext. If not setted, the 
ContentHandlerDecorator do not need to split the xhtml output and can fallback 
to default behavior. Then A custom EDE can extract tables to files if desired.

So now I think the big doc approah is not bad. What do you think?


was (Author: lfcnassif):
If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not 
be read, I think using EDE is not useful. How will this approach work with 
TikaCli --extract option? My original idea was to support an use case like 
TikaCli --extract...

Now I think this extraction of tables to files can be done handling the db as 
one big doc and using a ContentHandlerDecorator that will split the xhtml 
output at table boundaries. Each xhtml segment can be converted to a byte[] (if 
small) and then to a ByteArrayInputStream that can be handled by an 
EmbeddedDocDecorator, if setted into parseContext. If not setted the 
ContentHandlerDecorator do not need to split tables and can fallback to default 
behavior. A custom EDE can then extract tables to files if desired.

So now I think we could go with the big doc approah. What do you think?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1522) Exe being detected as application/x-msdownload

2015-01-19 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282874#comment-14282874
 ] 

Nick Burch edited comment on TIKA-1522 at 1/19/15 7:08 PM:
---

For that file, I'm getting {{application/x-msdownload; format=pe32}} from Tika 
App, which is a subtype of {{application/x-msdownload;format=pe}} which is a 
subtype of {{application/x-msdownload}}

{{application/x-dosexec}} is the mimetype which has the {{.exe}} glob, and 
that's also a subtype of {{application/x-msdownload}}

The problem is that a windows 32 bit DLL will also be detected as 
{{application/x-msdownload; format=pe32}}, and that shouldn't be 
{{application/x-dosexec}}

This might need some thought...

If anyone has some time, it'd be interesting to know what other mime-based 
tools use for windows DLLs and EXEs


was (Author: gagravarr):
For that file, I'm getting {{{application/x-msdownload; format=pe32}}} from 
Tika App, which is a subtype of {{{application/x-msdownload;format=pe}}} which 
is a subtype of {{{application/x-msdownload}}}

{{{application/x-dosexec}}} is the mimetype which has the {{{.exe}}} glob, and 
that's also a subtype of {{{application/x-msdownload}}}

The problem is that a windows 32 bit DLL will also be detected as 
{{{application/x-msdownload; format=pe32}}}, and that shouldn't be 
{{{application/x-dosexec}}}

This might need some thought...

 Exe being detected as application/x-msdownload
 --

 Key: TIKA-1522
 URL: https://issues.apache.org/jira/browse/TIKA-1522
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.7
Reporter: Luis Filipe Nassif
Assignee: Nick Burch
Priority: Minor
 Attachments: Search.exe


 If it is ok, *.exe must be included in application/x-msdownload glob pattern 
 definitions. If it should be detected as application/x-dosexec, the hierarchy 
 between application/x-dosexec, application/x-msdownload and PE based formats 
 must be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1522) Exe being detected as application/x-msdownload

2015-01-19 Thread Luis Filipe Nassif (JIRA)
Luis Filipe Nassif created TIKA-1522:


 Summary: Exe being detected as application/x-msdownload
 Key: TIKA-1522
 URL: https://issues.apache.org/jira/browse/TIKA-1522
 Project: Tika
  Issue Type: Bug
  Components: config
Affects Versions: 1.7
Reporter: Luis Filipe Nassif
Priority: Minor


If it is ok, *.exe must be included in application/x-msdownload glob pattern 
definitions. If it should be detected as application/x-dosexec, the hierarchy 
between application/x-dosexec, application/x-msdownload and PE based formats 
must be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1522) Exe being detected as application/x-msdownload

2015-01-19 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-1522:
-
Component/s: (was: config)
 mime

 Exe being detected as application/x-msdownload
 --

 Key: TIKA-1522
 URL: https://issues.apache.org/jira/browse/TIKA-1522
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.7
Reporter: Luis Filipe Nassif
Assignee: Nick Burch
Priority: Minor
 Attachments: Search.exe


 If it is ok, *.exe must be included in application/x-msdownload glob pattern 
 definitions. If it should be detected as application/x-dosexec, the hierarchy 
 between application/x-dosexec, application/x-msdownload and PE based formats 
 must be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1522) Exe being detected as application/x-msdownload

2015-01-19 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch reassigned TIKA-1522:


Assignee: Nick Burch

 Exe being detected as application/x-msdownload
 --

 Key: TIKA-1522
 URL: https://issues.apache.org/jira/browse/TIKA-1522
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.7
Reporter: Luis Filipe Nassif
Assignee: Nick Burch
Priority: Minor
 Attachments: Search.exe


 If it is ok, *.exe must be included in application/x-msdownload glob pattern 
 definitions. If it should be detected as application/x-dosexec, the hierarchy 
 between application/x-dosexec, application/x-msdownload and PE based formats 
 must be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1522) Exe being detected as application/x-msdownload

2015-01-19 Thread Luis Filipe Nassif (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Filipe Nassif updated TIKA-1522:
-
Attachment: Search.exe

Example PE file that triggers the issue.

 Exe being detected as application/x-msdownload
 --

 Key: TIKA-1522
 URL: https://issues.apache.org/jira/browse/TIKA-1522
 Project: Tika
  Issue Type: Bug
  Components: config
Affects Versions: 1.7
Reporter: Luis Filipe Nassif
Priority: Minor
 Attachments: Search.exe


 If it is ok, *.exe must be included in application/x-msdownload glob pattern 
 definitions. If it should be detected as application/x-dosexec, the hierarchy 
 between application/x-dosexec, application/x-msdownload and PE based formats 
 must be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1522) Exe being detected as application/x-msdownload

2015-01-19 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282874#comment-14282874
 ] 

Nick Burch commented on TIKA-1522:
--

For that file, I'm getting {{{application/x-msdownload; format=pe32}}} from 
Tika App, which is a subtype of {{{application/x-msdownload;format=pe}}} which 
is a subtype of {{{application/x-msdownload}}}

{{{application/x-dosexec}}} is the mimetype which has the {{{.exe}}} glob, and 
that's also a subtype of {{{application/x-msdownload}}}

The problem is that a windows 32 bit DLL will also be detected as 
{{{application/x-msdownload; format=pe32}}}, and that shouldn't be 
{{{application/x-dosexec}}}

This might need some thought...

 Exe being detected as application/x-msdownload
 --

 Key: TIKA-1522
 URL: https://issues.apache.org/jira/browse/TIKA-1522
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.7
Reporter: Luis Filipe Nassif
Assignee: Nick Burch
Priority: Minor
 Attachments: Search.exe


 If it is ok, *.exe must be included in application/x-msdownload glob pattern 
 definitions. If it should be detected as application/x-dosexec, the hierarchy 
 between application/x-dosexec, application/x-msdownload and PE based formats 
 must be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1522) Exe being detected as application/x-msdownload

2015-01-19 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-1522:
-
Assignee: (was: Nick Burch)

 Exe being detected as application/x-msdownload
 --

 Key: TIKA-1522
 URL: https://issues.apache.org/jira/browse/TIKA-1522
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.7
Reporter: Luis Filipe Nassif
Priority: Minor
 Attachments: Search.exe


 If it is ok, *.exe must be included in application/x-msdownload glob pattern 
 definitions. If it should be detected as application/x-dosexec, the hierarchy 
 between application/x-dosexec, application/x-msdownload and PE based formats 
 must be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Yamileydis Veranes (JIRA)
Yamileydis Veranes created TIKA-1523:


 Summary: metadata extractor gets the wrong number of pages of some 
documents Microsoft Word 9.0
 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
Reporter: Yamileydis Veranes


When I extract the metadata from a Microsoft Word 9.0 document which has 10 
pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Yamileydis Veranes (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yamileydis Veranes updated TIKA-1523:
-
Environment: Ubuntu

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes

 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Yamileydis Veranes (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yamileydis Veranes updated TIKA-1523:
-
Attachment: Sigmund Freud.doc

metadata extractor obtains an incorrect number of pages to this document

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
 Attachments: Sigmund Freud.doc


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)