Hi Alex, Thank you very much for your responses. Notes inline.
On Tue, Sep 20, 2011 at 4:37 AM, Alex Vandiver <[email protected]> wrote: > On Mon, 2011-09-19 at 13:24 +1000, fab junkmail wrote: >> 2011-09-19 02:08:28 UTC ERROR: string is too long for tsvector >> (3831236 bytes, max 1048575 bytes) >> 2011-09-19 02:08:28 UTC STATEMENT: UPDATE Attachments SET >> ContentIndex = to_tsvector($1) WHERE id = $2 >> >> >> I think it is getting to a ticket that has too many unique words so it >> can't index it and it critically fails and stops indexing any further. > > You are correct that this is because the content of one of the > attachments contains too many unique words (after removing stopwords and > doing stemming). This is symptomatic of a pathological case -- for > example, the entirety of "A Tale of Two Cities" (775K) creates a 121K > tsvector and the entire corpus of the King James Bible (4.3M) creates a > 160K tsvector. In contrast, the contents of my /usr/share/dict/words > (916K) produces a 524K tsvector, because there is so little word > repetition. > > Knowing what text/plain or text/html corpus you have in your database > which is blowing so significantly past this limit (generating a 3.8M > tsvector is impressive) would be interesting. I suspect the data in > question is not actually textual data. If you re-run > rt-fulltext-indexer with --debug, the last attachment number it prints > will tell you which attachment if the problematic one. I used the --debug option and was able to find the ticket that caused the problem. The ticket has a 12MB, 125000 line .txt attachment that is a file system file listing with full paths. Eg: ... .//Accounts/Backup/New Folder .//Accounts/Backup/PRM0124.ZIP .//Accounts/Backup/PRM0206.ZIP .//Accounts/Backup/PRM0207.ZIP .//Accounts/Backup/PRM0214.1.ZIP ... So in many cases it would have counted the whole line as a unique word. That explains how the tsvector for this attachment got to over 3MB. > >> I would appreciate some advice on how I can proceed with getting the >> rest of my data indexed. I think any of the following would be >> suitable but I don't know how to implement them (I am not a coder or a >> dba) and could use some help. Options: >> >> - modify the rt-fulltext-indexer script to truncate strings that are >> "too long for tsvector". or > > As pointed out above, long corpuses can generate perfectly reasonably > sized tsvectors. Truncating your input strings before indexing will > yield false negatives in perfectly reasonable text; as such, the change > from the wiki will not be taken into core. > >> - modify the rt-fulltext-indexer script to skip tickets that have that >> issue and continue indexing other tickets. or > > rt-fulltext-indexer currently iterates every attachment content and > updates the tsvector one at a time; as such, modifying it to trap the > update with an eval {} block and continue for particular error cases > should be completely feasible. > >> - find out which ticket is causing the problem (hopefully only one) >> and maybe I can delete it before running the rt-fulltext-indexer >> script. or > > As I noted above, I suspect the row in question is not actually textual > data, despite being marked text/plain. As I noted above, running with > --debug may shed some light on the contents which are at issue. > - Alex > > I have deleted (using shredder) the problem ticket and I am re-running "/opt/rt4/sbin/rt-fulltext-indexer-mod --all". It has been running for a few minutes now so looks like it has got past the problem and is working. If it has another problem I now know how to work around it. Thanks very much for your help Alex. Regards, Anthony -------- RT Training Sessions (http://bestpractical.com/services/training.html) * Chicago, IL, USA September 26 & 27, 2011 * San Francisco, CA, USA October 18 & 19, 2011 * Washington DC, USA October 31 & November 1, 2011 * Melbourne VIC, Australia November 28 & 29, 2011 * Barcelona, Spain November 28 & 29, 2011
