TIKA relies on this information to be given in the Word file (it is just Metadata somewhere in the header of the file). TIKA does *not* count the words, so it relies on the Application that saved the file. If Open-/LibreOffice does not do this, TIKA cannot get it. You can verify this with Microsoft Windows, if you right click on the word file and select "Details" tab. Windows Explorer shows the metadata on this tab, if it does not display word count it is really not in the file.
----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] > -----Original Message----- > From: nilesh gorle [mailto:[email protected]] > Sent: Wednesday, February 13, 2013 10:51 AM > To: [email protected] > Subject: Fwd: Query On Apache Tika > > Hello, > > I am using apache tika. Its really better choice. > But, I need your help for word counting. I used follwing command for getting > WORD-COUNT from METADATA > > input -: java -jar tika_cmd.jar --metadata XXX.doc > > output -: > > Application-Name: Microsoft Office Word > Author: XXX > Character Count: 10329 > Company: > Content-Length: 47616 > Content-Type: application/msword > Creation-Date: 2012-08-01T14:34:00Z > Edit-Time: 600000000 > Last-Modified: 2012-08-01T14:34:00Z > Last-Printed: 2012-08-01T14:32:00Z > Last-Save-Date: 2012-08-01T14:34:00Z > Page-Count: 6 > Revision-Number: 2 > Template: Normal.dotm > Word-Count: 1812 > cp:revision: 2 > creator: xXX > date: 2012-08-01T14:34:00Z > dc:creator: XXX > dc:title: MUTUAL CONFIDENTIALITY AGREEMENT > dcterms:created: 2012-08-01T14:34:00Z > dcterms:modified: 2012-08-01T14:34:00Z > extended-properties:Application: Microsoft Office Word > extended-properties:Company: > extended-properties:Template: Normal.dotm > meta:author: XXX > meta:character-count: 10329 > meta:creation-date: 2012-08-01T14:34:00Z > meta:last-author: Roxanne Potgieter > meta:page-count: 6 > meta:print-date: 2012-08-01T14:32:00Z > meta:save-date: 2012-08-01T14:34:00Z > meta:word-count: 1812 > modified: 2012-08-01T14:34:00Z > resourceName: Confidentiality Agreement.doc > title: MUTUAL CONFIDENTIALITY AGREEMENT > xmpTPg:NPages: 6 > > Now I am using same command for other documents which is created in > Openoffice or Libreoffice and save it as doc, docx, xls, xlsx, ppt, pptx. > So I am not getting WORD-COUNT > > input -: java -jar tika_cmd.jar --metadata XXX.doc ( XXX.doc is > file which is created in openoffice or libreoffice) > > output -: > > Application-Name: Microsoft Excel > Application-Version: 12.0000 > Author: XXX > Content-Length: 15986 > Content-Type: > application/vnd.openxmlformats-officedocument.spreadsheetml.sheet > Creation-Date: 2013-01-30T16:15:54Z > Last-Modified: 2013-02-05T14:13:31Z > Last-Save-Date: 2013-02-05T14:13:31Z > creator: XXX > date: 2013-01-30T16:15:54Z > dc:creator: XXX > dc:publisher: XXX > dcterms:created: 2013-01-30T16:15:54Z > dcterms:modified: 2013-02-05T14:13:31Z > extended-properties:AppVersion: 12.0000 > extended-properties:Application: Microsoft Excel > extended-properties:Company: XXX > meta:author: XXX > meta:creation-date: 2013-01-30T16:15:54Z > meta:last-author: XXX > meta:save-date: 2013-02-05T14:13:31Z > modified: 2013-02-05T14:13:31Z > protected: false > publisher: leosys > resourceName: XXX > > Please, suggest me why I am not getting WORD-COUNT > > ---------- Forwarded message ---------- > From: nilesh gorle <[email protected]> > Date: 13 February 2013 11:38 > Subject: Query On Apache Tika > To: [email protected] > > > Hello, > > I am using apache tika. Its really better choice. > But, I need your help for word counting. I used follwing command for getting > WORD-COUNT from METADATA > > input -: java -jar tika_cmd.jar --metadata XXX.doc > > output -: > > Application-Name: Microsoft Office Word > Author: XXX > Character Count: 10329 > Company: > Content-Length: 47616 > Content-Type: application/msword > Creation-Date: 2012-08-01T14:34:00Z > Edit-Time: 600000000 > Last-Modified: 2012-08-01T14:34:00Z > Last-Printed: 2012-08-01T14:32:00Z > Last-Save-Date: 2012-08-01T14:34:00Z > Page-Count: 6 > Revision-Number: 2 > Template: Normal.dotm > Word-Count: 1812 > cp:revision: 2 > creator: xXX > date: 2012-08-01T14:34:00Z > dc:creator: XXX > dc:title: MUTUAL CONFIDENTIALITY AGREEMENT > dcterms:created: 2012-08-01T14:34:00Z > dcterms:modified: 2012-08-01T14:34:00Z > extended-properties:Application: Microsoft Office Word > extended-properties:Company: > extended-properties:Template: Normal.dotm > meta:author: XXX > meta:character-count: 10329 > meta:creation-date: 2012-08-01T14:34:00Z > meta:last-author: Roxanne Potgieter > meta:page-count: 6 > meta:print-date: 2012-08-01T14:32:00Z > meta:save-date: 2012-08-01T14:34:00Z > meta:word-count: 1812 > modified: 2012-08-01T14:34:00Z > resourceName: Confidentiality Agreement.doc > title: MUTUAL CONFIDENTIALITY AGREEMENT > xmpTPg:NPages: 6 > > Now I am using same command for other documents which is created in > Openoffice or Libreoffice and save it as doc, docx, xls, xlsx, ppt, pptx. > So I am not getting WORD-COUNT > > input -: java -jar tika_cmd.jar --metadata XXX.doc ( XXX.doc is > file which is created in openoffice or libreoffice) > > output -: > > Application-Name: Microsoft Excel > Application-Version: 12.0000 > Author: XXX > Content-Length: 15986 > Content-Type: > application/vnd.openxmlformats-officedocument.spreadsheetml.sheet > Creation-Date: 2013-01-30T16:15:54Z > Last-Modified: 2013-02-05T14:13:31Z > Last-Save-Date: 2013-02-05T14:13:31Z > creator: XXX > date: 2013-01-30T16:15:54Z > dc:creator: XXX > dc:publisher: XXX > dcterms:created: 2013-01-30T16:15:54Z > dcterms:modified: 2013-02-05T14:13:31Z > extended-properties:AppVersion: 12.0000 > extended-properties:Application: Microsoft Excel > extended-properties:Company: XXX > meta:author: XXX > meta:creation-date: 2013-01-30T16:15:54Z > meta:last-author: XXX > meta:save-date: 2013-02-05T14:13:31Z > modified: 2013-02-05T14:13:31Z > protected: false > publisher: leosys > resourceName: XXX > > Please, suggest me why I am not getting WORD-COUNT > > -- > Thanks & Regards -: > > Nilesh G. > [email protected] > > > > > -- > Thanks & Regards -: > > Nilesh G. > [email protected] > 9970056516
