Dear Goetz,

I have the same requirement (patent documents containing text in different 
languages).
I ended up splitting/filtering each original document in localized parts 
inserted in different collections (each collection having its own full text 
index configuration).
BaseX is as flexible as our data !

Best regards,


De : [email protected] 
[mailto:[email protected]] De la part de Goetz Heller
Envoyé : mercredi 22 avril 2015 10:50
À : [email protected]
Objet : [basex-talk] multi-language full-text indexing

I'm working with documents destined to be consumed anywhere in the European 
Community. Many of them have the same tags multiple times but with a different 
language attribute. It does not make sense to create a full-text index for the 
whole of these documents therefore. It is desirable to have documents indexed 
by locale-specific parts, e.g.

CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
(path_a)/LOCALIZED_PART_A[@LANG=$lang],
(path_b)/LOCALIZED_PART_B[@LG=$lang],...
) FOR LANGUAGE $lang IN (
BG,
DN,
DE WITH STOPWORDS filepath_de WITH STEM = YES,
EN WITH STOPWORDS filepath_en,
FR, ...
)  [USING language_code_map]
and then to write full-text retrieval queries with a clause such as 'FOR 
LANGUAGE BG', for example. The index parts would be much smaller and full-text 
retrieval therefore much faster. The language codes would be mapped somehow to 
standard values recognized by BaseX in the language_code_map file.
Are there any efforts towards such a feature?

Reply via email to