Re: [Nepomuk] Review Request: Split the contents of odf files into words

Sebastian Trueg Wed, 24 Aug 2011 02:14:17 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/102356/#review5974
-----------------------------------------------------------



Commenting myself: After a discussion with Jos Vandenoever I understood that 
indeed each call to addText is supposed to add another fragment of text. Here a 
fragment is a set of words. Thus, it is up to the indexer to add white space 
where appropriate.

- Sebastian


On Aug. 17, 2011, 7:20 p.m., Sebastian Trueg wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://git.reviewboard.kde.org/r/102356/
> -----------------------------------------------------------
> 
> (Updated Aug. 17, 2011, 7:20 p.m.)
> 
> 
> Review request for Nepomuk and Strigi.
> 
> 
> Summary
> -------
> 
> The problem is simple: when indexing the text from the cells in ods documents 
> the analyser currently simply calls addText for each cell. This results in 
> the backend (indexer) to concatenate all those strings which in turn means 
> invalid tokenization for full-text-search.
> 
> xmlindexer and rdfindexer work around this by adding a newline after each 
> block of text added via addText. This, however, is clearly wrong since 1. the 
> API does not suggest that, 2. all other plugins - most prominently the text 
> analyser - do not strip away any line feeds, and 3. it would significantly 
> lower the power of the API to provide a line-based interface.
> 
> Thus, the only correct approach is to take care of proper text handling in 
> the analysers. In this case the simplest way is to add a space after each 
> token.
> 
> 
> Diffs
> -----
> 
>   lib/helperanalyzers/odfcontenthelperanalyzer.h 4fbfd45 
>   lib/helperanalyzers/odfcontenthelperanalyzer.cpp d2a0a72 
> 
> Diff: http://git.reviewboard.kde.org/r/102356/diff
> 
> 
> Testing
> -------
> 
> Indexing an ods results in proper tokenization for cell content. Indexing an 
> odt results in the last word of a line not being concatenated with the first 
> word of the next line.
> 
> 
> Thanks,
> 
> Sebastian
> 
>

_______________________________________________
Nepomuk mailing list
[email protected]
https://mail.kde.org/mailman/listinfo/nepomuk

Re: [Nepomuk] Review Request: Split the contents of odf files into words

Reply via email to