[Nepomuk] Review Request: Split the contents of odf files into words

Sebastian Trueg Wed, 17 Aug 2011 12:20:33 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/102356/
-----------------------------------------------------------


Review request for Nepomuk and Strigi.


Summary
-------

The problem is simple: when indexing the text from the cells in ods documents 
the analyser currently simply calls addText for each cell. This results in the 
backend (indexer) to concatenate all those strings which in turn means invalid 
tokenization for full-text-search.

xmlindexer and rdfindexer work around this by adding a newline after each block 
of text added via addText. This, however, is clearly wrong since 1. the API 
does not suggest that, 2. all other plugins - most prominently the text 
analyser - do not strip away any line feeds, and 3. it would significantly 
lower the power of the API to provide a line-based interface.

Thus, the only correct approach is to take care of proper text handling in the 
analysers. In this case the simplest way is to add a space after each token.


Diffs
-----

  lib/helperanalyzers/odfcontenthelperanalyzer.h 4fbfd45 
  lib/helperanalyzers/odfcontenthelperanalyzer.cpp d2a0a72 

Diff: http://git.reviewboard.kde.org/r/102356/diff


Testing
-------

Indexing an ods results in proper tokenization for cell content. Indexing an 
odt results in the last word of a line not being concatenated with the first 
word of the next line.


Thanks,

Sebastian

_______________________________________________
Nepomuk mailing list
[email protected]
https://mail.kde.org/mailman/listinfo/nepomuk

[Nepomuk] Review Request: Split the contents of odf files into words

Reply via email to