Am 2013-12-16 um 16:55 schrieb Tibor Simko <[email protected]>: > On Fri, 06 Dec 2013, [email protected] wrote: >> But it looks like I *should* customize it: >> English stemming makes probably no sense for Russian or Kyrgyz >> content. > > Yes, definitely, you should customise it WRT your tag selection and the > deemed importance of information in various fields, and then WRT > language used.
Yes, I adapted wrd.cfg, and didn’t get that error for five days - look successful! >> But there’s probably no algorithm for these languages anyway. > Russian is supported. Kyrgyz is not. Maybe we could use a Turkish stemmer - Kyrgyz is related. >> And I don’tunderstand why one should define the stemming language per >> field - I guess we’re not the first library with content in different >> languages. > > We used to store different languages in different fields. E.g. CERN > bulletin is bilingual English/French and the articles look like this: > http://cds.cern.ch/record/1633174/export/hm Yes, I saw this. But it looks to me like bad database design to use different fields for different languages - we don’t just have two of them. Even if most of our content is in Russian or Kyrgyz, there are also some English, Uzbek, Tadjik (Persian), Kazakh and maybe Dungan (Chinese) documents and media files. Central Asia is really a diverse region. We define the content language as 3-letter ISO code in 041__a, that seemed the most logical choice to me. We could change that to a different ISO code version, if it makes sense. But „in my book“ we should be able to tell BibRank (or any other module that needs stemming) where to look for the language of the record and use an appropriate stemmer (or none). I know, it’s an OS project, I should propose a patch myself. ;-) > MARC-wise, we should ideally make use of fields such as 242 (title > translation) and read language information from the subfield there: > http://www.loc.gov/marc/bibliographic/bd242.html But „title translation“ sounds not like original titles in different languages, does it? > While this is already possible and we are using this technique for many > modules, BibRank does not understand it yet. A pity. >> These are two of the records where the indexer crashes: > > Thanks. Many of the fields are not recognised, e.g. 653 in the records > vs 6531/6532 in the default wrd.cfg. Please try to (i) amend wrd.cfg; > (ii) hard-delete your phantom records; (iii) rebalance ranking weights > again to see if things improve. As I wrote above - I tried, and it helped. But I’m not sure if it makes sense what I configured. [word_similarity] stemming = ru table = rnkWORD01F stopword = False relevance_number_output_prologue = ( relevance_number_output_epilogue = ) #MARC tag, tag points, tag language tag1 = 653__a, 2, ru tag2 = 245__%, 10, ru tag3 = 520__%, 2, ru tag4 = 852__%, 2, en tag5 = 100__%, 3, none tag6 = 700__%, 2, none tag7 = 490__%, 5, ru tag8 = 260__%, 1, ru BTW again thank you for your valuable support! The library bus is now on its way, not all of the files are properly indexed, but it’s quite usable. Over the holidays the server will come back to me, then I can do some more fixes other than what it can get automatically. - It runs a cronjob to check if it has internet connection and then tries to update from my git repo and synchronize database and media files, as outlined in the other thread. BTW MySQL master-slave replication didn’t work, I had no time to research what the problem was, we resolved to copying and installing sql dumps. Greetlings, Hraban --- http://www.fiee.net https://www.cacert.org (I'm an assurer)

