Hi Tibor,
> On Thu, 27 Jun 2013, [email protected] wrote: >> So, the my question should rather be double: (a) how to change id_bibrec >> type > > Firstly, can you look at your fulltext terms and see whether they are > relevant? E.g.: > > $ echo "SELECT term FROM idxWORD09F LIMIT 100" | dbexec > > If many of them look bogus, maybe you'd want to plug more aggressive > stemming and/or check text extraction procedures. This may be the best > solution for the end users. There is a lot of garbage, indeed, like that: term ^A^A ^A^B^B^C^D^E^F^G ^A^B^C^D^B^E^F^E ^A^B^C ^A^E^B^G ^A^G ^A ^A&^Q ^A&^Q:f>257^Q$5^Q&5>2^R^A= ^A2^R^Q^A ^Aa5j ^Aa5j$^A2^R^Q^A#2>k^R ^At−1 ^B^A ^B^At ^B^Ba ^B^Baa ^B^Baccessing ^B^Bacts ^B^Badopting ^B^Baegon [...] ^P^P¦^P<8B>¦^P¦^P^P¦^P¦^P<8B>¦^P^P¦^P¦^Pµ!!1^X(ii<83>3<80>!(ii<98>c<81>%^V ^P^P¦^P<8B>¦^P¦^P^P¦^P¦^P<8B>¦^P^P¦^P¦g ^P^P¦^P<8B>¦^P¦^P^P¦^P¦^P<8B>¦^P^P¦^P¦¸<92>31d#%<82>%i3%c ^P^P¦^P<8B>¦^P¦^P^P¦^P¦^P<8B>¦^P^P¦^P¦¸<92>31d#%<82>%i3%c!eaiew(^Y<8B>!^X ^P^P¦^P<8B>¦^P¦^P^P¦^P¦^P<8B>¦^P^P¦^P®3g<97>ib`c^Vi<98>3^Y¯#i'xw12<83>¼ ^P^P¦^P<8B>¦^P¦^P^P¦^P¦^P<8B>¦^P^P¦g [...] and so on. As you can see, there are thousands of entries with control characters. Stemming is not yet configured, and text extraction is (mostly) pdftotext. Maybe I did some tuning long ago that could influence the presence of those control characters, but, if that was the case, it has been overwritten at least during this last upgrade. My question is: how can I «check text extraction procedures» as you have recommended me? > Secondly, since you do not have 16M records in your system, you should > not need to alter idxWORD09R.id_bibrec, you only have to alter > idxWORD09F.id from MEDIUMINT UNSIGNED to INT UNSIGNED. This would > overcome the 16M limit and should fix the troubles. Thanks for the hint. I'll try on my test machine first. >> and (b) how to recreate fulltext index. > > If you want to do it on the production server and not on replica, then > you can launch something like: > > $ bibindex -u admin -w fulltext -a -i 1000-2000 -P -1 Reindexing using those parameters should be enough to remove my bogus entries? I've checked and the other indexes don't have control characters. Thanks, Ferran

