-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/113217/
-----------------------------------------------------------

Review request for Nepomuk.


Repository: nepomuk-core


Description
-------

This patch adds a File Extractor for doc, xls and ppt files (the binary MS 
Office formats). The current version of the extractor is very simple and only 
indexes the plain text content of the files (no title nor owner information is 
extracted). The extractor is a tiny wrapper around the "catdoc", "catppt" and 
"xls2csv" command-line utilities. These tools are packaged in the "catdoc" 
package of Debian and openSUSE.

These utilities are released under the GNU GPLv2. If I recall correctly, the 
LGPLv2.1 Nepomuk libraries can use these tools provided no library calls are 
made to them. The extractor uses QProcess to launch an instance of catdoc, 
catppt or xls2csv, giving it the name of the file to index, and gets the plain 
text from the standard output of this process. I hope this complies with the 
GPL.

The commands are located at run-time using KStandardDirs. This way, no new 
build dependency is added to Nepomuk, and it is up to the user or the 
distribution to add "catdoc" to the dependency list of Nepomuk. If a command is 
not found, the indexer is disabled for the specific MIME type handled by the 
command.


Diffs
-----

  services/fileindexer/indexer/officeextractor.cpp PRE-CREATION 
  services/fileindexer/indexer/officeextractor.h PRE-CREATION 
  services/fileindexer/indexer/nepomukofficeextractor.desktop PRE-CREATION 

Diff: http://git.reviewboard.kde.org/r/113217/diff/


Testing
-------

I have run the indexer on several DOC, XLS and PPT files I have on my computer. 
The indexer doesn't work on encrypted files (catdoc refuses to parse them). 
This is embarrassing because some interesting Excel files are 
password-protected only on select pages, or only the edition of certain cells 
is prohibited. The rest of the file can contain valuable data and should be 
indexed.


Thanks,

Denis Steckelmacher

_______________________________________________
Nepomuk mailing list
[email protected]
https://mail.kde.org/mailman/listinfo/nepomuk

Reply via email to