Re: [basex-talk] multi-language full-text indexing

Christopher Yocum Thu, 23 Apr 2015 01:33:13 -0700

Hi Christian,

No problem.  I am always happy to help.


I did not try that as I did not have the time to implement something
like that for the project.  Also, on my machine and on the vps we were
using at least, it was fast enough for most uses.  Also, I told the
user about the problem as "using diacritics" was an "all or nothing"
setting and there was not much I could do to make it more granular.
They were content with that but I thought as the subject came up that
I would mention something.

Please let me know if you need any more information.

All the best,
Chris

On Wed, Apr 22, 2015 at 10:24 PM, Christian Grün
<christian.gr...@gmail.com> wrote:
> Chris,
>
> Thanks for your feedback.
>
> Yes, I see that there is a lot of demand for a more customizable
> full-text index. Did you already try to build some additional index
> databases, based on the rules you were listing here? It's not as
> comfortable as a tightly coupled full-text index, but the more use
> case I get to hear of, the more I wonder if we could at all manage to
> satisfy everyone's needs..
>
> Cheers,
> Christian
>
>
>> Wed, Apr 22, 2015 at 11:20 PM, Chris Yocum <cyo...@gmail.com> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> Hi,
>>
>> I just want to say that for the dictionary that I used BaseX for,
>> having a multi-lingual full text would have been very nice.  Bar that
>> a partial index based on certain rules the user supplies would have
>> also been nice.  For instance, being able to distinguish between á, a,
>> and ā in a word.  In early Irish textual criticism, length marks are
>> often added by text editors with a macron to denote a long vowel that
>> has been idenified by the editor but is not in the original text.
>> Being able to say: "build an index with á and a but not ā" would be
>> helpful.
>>
>> I would suggest as a first pass building the index by using xml:lang
>> attributes to determine what stemmer to use, etc.  If the document has
>> supplied them, you could use them to build the indices differently
>> based on it.
>>
>> All the best,
>> Chris
>>
>> On Wed, Apr 22, 2015 at 11:35:48AM +0200, Goetz Heller wrote:
>>> Here's another addendum: Even if multi-language full-text indexing is not 
>>> going tob e implemented in the near future, it still would be a useful 
>>> feature to be able to restrict  full-text indexing to parts of a document, 
>>> e.g.
>>>
>>> CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
>>>       (path_a)/PART_A,
>>>       (path_b)/ PART_B,…
>>> )
>>>
>>> Kind regards,
>>>
>>> Goetz
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Christian Grün [mailto:christian.gr...@gmail.com]
>>> Gesendet: Mittwoch, 22. April 2015 11:03
>>> An: Goetz Heller
>>> Cc: BaseX
>>> Betreff: Re: [basex-talk] multi-language full-text indexing
>>>
>>> > It is desirable to have
>>> > documents indexed by locale-specific parts, e.g.
>>>
>>> I can see that this would absolutely make sense, but it would be quite some 
>>> effort to realize it. There are also various conceptul issues related to 
>>> XQuery Full Text: If you don't specify the language in the query, we'd need 
>>> to dynamically decide what stemmers to use for the query strings, depending 
>>> on the nodes that are currently targeted.
>>> This would pretty much blow up the existing architecture.
>>>
>>> As there are so many other types of index structures that could be helpful, 
>>> depending on the particular use case, we usually recommend users to create 
>>> additional BaseX databases, which can then serve as indexes. This can all 
>>> be done in XQuery. I remember there have been various examples for this on 
>>> this mailing list (see e.g. [1,2]).
>>>
>>> Hope this helps,
>>> Christian
>>>
>>> [1] 
>>> https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
>>> [2] 
>>> https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html
>>>
>>>
>>>
>>>
>>> >
>>> >
>>> >
>>> > CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
>>> >
>>> > (path_a)/LOCALIZED_PART_A[@LANG=$lang],
>>> >
>>> > (path_b)/LOCALIZED_PART_B[@LG=$lang],…
>>> >
>>> > ) FOR LANGUAGE $lang IN (
>>> >
>>> > BG,
>>> >
>>> > DN,
>>> >
>>> > DE WITH STOPWORDS filepath_de WITH STEM = YES,
>>> >
>>> > EN WITH STOPWORDS filepath_en,
>>> >
>>> > FR, …
>>> >
>>> > )  [USING language_code_map]
>>> >
>>> > and then to write full-text retrieval queries with a clause such as
>>> > ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller
>>> > and full-text retrieval therefore much faster. The language codes
>>> > would be mapped somehow to standard values recognized by BaseX in the
>>> > language_code_map file.
>>> >
>>> > Are there any efforts towards such a feature?
>>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1
>>
>> iF4EAREIAAYFAlU4EIcACgkQDjE+CSbP7HrWyAEAsW698gwkbKrtEb9Vkv7S1aV/
>> r5YyIp/UKv9k8gYxqboA/0+oesCxM6K2dZGEfIolUJG+x3vhxMfocY+QGpwjtLhy
>> =Zq8Q
>> -----END PGP SIGNATURE-----

Re: [basex-talk] multi-language full-text indexing

Reply via email to