Re: ZeroDivisionError in bibrank_word_indexer

Henning Hraban Ramm Mon, 16 Dec 2013 09:56:15 -0800

Am 2013-12-16 um 16:55 schrieb Tibor Simko <[email protected]>:

> On Fri, 06 Dec 2013, [email protected] wrote:
>> But it looks like I *should* customize it:
>> English stemming makes probably no sense for Russian or Kyrgyz
>> content.
> 
> Yes, definitely, you should customise it WRT your tag selection and the
> deemed importance of information in various fields, and then WRT
> language used.


Yes, I adapted wrd.cfg, and didn’t get that error for five days - look 
successful!

>> But there’s probably no algorithm for these languages anyway.
> Russian is supported.  Kyrgyz is not.

Maybe we could use a Turkish stemmer - Kyrgyz is related.

>> And I don’tunderstand why one should define the stemming language per
>> field - I guess we’re not the first library with content in different
>> languages.
> 
> We used to store different languages in different fields.  E.g. CERN
> bulletin is bilingual English/French and the articles look like this:
>   http://cds.cern.ch/record/1633174/export/hm

Yes, I saw this. But it looks to me like bad database design to use different 
fields for different languages - we don’t just have two of them.
Even if most of our content is in Russian or Kyrgyz, there are also some 
English, Uzbek, Tadjik (Persian), Kazakh and maybe Dungan (Chinese) documents 
and media files. Central Asia is really a diverse region.

We define the content language as 3-letter ISO code in 041__a, that seemed the 
most logical choice to me. We could change that to a different ISO code 
version, if it makes sense.
But „in my book“ we should be able to tell BibRank (or any other module that 
needs stemming) where to look for the language of the record and use an 
appropriate stemmer (or none).

I know, it’s an OS project, I should propose a patch myself. ;-)

> MARC-wise, we should ideally make use of fields such as 242 (title
> translation) and read language information from the subfield there:
>   http://www.loc.gov/marc/bibliographic/bd242.html

But „title translation“ sounds not like original titles in different languages, 
does it?

> While this is already possible and we are using this technique for many
> modules, BibRank does not understand it yet.

A pity.

>> These are two of the records where the indexer crashes:
> 
> Thanks.  Many of the fields are not recognised, e.g. 653 in the records
> vs 6531/6532 in the default wrd.cfg.  Please try to (i) amend wrd.cfg;
> (ii) hard-delete your phantom records; (iii) rebalance ranking weights
> again to see if things improve.

As I wrote above - I tried, and it helped.
But I’m not sure if it makes sense what I configured.

[word_similarity]
stemming = ru
table = rnkWORD01F
stopword = False
relevance_number_output_prologue = (
relevance_number_output_epilogue = )
#MARC tag, tag points, tag language
tag1 = 653__a, 2, ru
tag2 = 245__%, 10, ru
tag3 = 520__%, 2, ru
tag4 = 852__%, 2, en
tag5 = 100__%, 3, none
tag6 = 700__%, 2, none
tag7 = 490__%, 5, ru
tag8 = 260__%, 1, ru


BTW again thank you for your valuable support!


The library bus is now on its way, not all of the files are properly indexed, 
but it’s quite usable.

Over the holidays the server will come back to me, then I can do some more 
fixes other than what it can get automatically. - It runs a cronjob to check if 
it has internet connection and then tries to update from my git repo and 
synchronize database and media files, as outlined in the other thread. BTW 
MySQL master-slave replication didn’t work, I had no time to research what the 
problem was, we resolved to copying and installing sql dumps.

Greetlings, Hraban
---
http://www.fiee.net
https://www.cacert.org (I'm an assurer)

Re: ZeroDivisionError in bibrank_word_indexer

Reply via email to