Hi Tibor,

> On Thu, 27 Jun 2013, [email protected] wrote:
>> So, the my question should rather be double: (a) how to change id_bibrec
>> type
>
> Firstly, can you look at your fulltext terms and see whether they are
> relevant?  E.g.:
>
>    $ echo "SELECT term FROM idxWORD09F LIMIT 100" | dbexec
>
> If many of them look bogus, maybe you'd want to plug more aggressive
> stemming and/or check text extraction procedures.  This may be the best
> solution for the end users.

There is a lot of garbage, indeed, like that:

 term
 ^A^A
 ^A^B^B^C^D^E^F^G
 ^A^B^C^D^B^E^F^E
 ^A^B^C
 ^A^E^B^G
 ^A^G
 ^A
 ^A&^Q
 ^A&^Q:f>257^Q$5^Q&5>2^R^A=
 ^A2^R^Q^A
 ^Aa5j
 ^Aa5j$^A2^R^Q^A#2>k^R
 ^At−1
 ^B^A
 ^B^At
 ^B^Ba
 ^B^Baa
 ^B^Baccessing
 ^B^Bacts
 ^B^Badopting
 ^B^Baegon
 [...]
 ^P^P¦^P<8B>¦^P¦^P^P¦^P¦^P<8B>¦^P^P¦^P¦^Pµ!!1^X(ii<83>3<80>!(ii<98>c<81>%^V
 ^P^P¦^P<8B>¦^P¦^P^P¦^P¦^P<8B>¦^P^P¦^P¦g
 ^P^P¦^P<8B>¦^P¦^P^P¦^P¦^P<8B>¦^P^P¦^P¦¸<92>31d#%<82>%i3%c
 ^P^P¦^P<8B>¦^P¦^P^P¦^P¦^P<8B>¦^P^P¦^P¦¸<92>31d#%<82>%i3%c!eaiew(^Y<8B>!^X
 ^P^P¦^P<8B>¦^P¦^P^P¦^P¦^P<8B>¦^P^P¦^P®3g<97>ib`c^Vi<98>3^Y¯#i'xw12<83>¼
 ^P^P¦^P<8B>¦^P¦^P^P¦^P¦^P<8B>¦^P^P¦g
 [...]

and so on.  As you can see, there are thousands of entries with control
characters.  Stemming is not yet configured, and text extraction is
(mostly) pdftotext.  Maybe I did some tuning long ago that could
influence the presence of those control characters, but, if that was the
case, it has been overwritten at least during this last upgrade.

My question is: how can I «check text extraction procedures» as you have
recommended me?

> Secondly, since you do not have 16M records in your system, you should
> not need to alter idxWORD09R.id_bibrec, you only have to alter
> idxWORD09F.id from MEDIUMINT UNSIGNED to INT UNSIGNED.  This would
> overcome the 16M limit and should fix the troubles.

Thanks for the hint.  I'll try on my test machine first.

>> and (b) how to recreate fulltext index.
>
> If you want to do it on the production server and not on replica, then
> you can launch something like:
>
>    $ bibindex -u admin -w fulltext -a -i 1000-2000 -P -1

Reindexing using those parameters should be enough to remove my bogus
entries?  I've checked and the other indexes don't have control
characters.

Thanks,

Ferran

Reply via email to