Github user xristy commented on the issue:
https://github.com/apache/jena/pull/246
@osma I'm happy to give some background. We are developing a multilingual
cultural heritage system to handle in particular Tibetan, Pali, and Sanskrit
which are not handled currently among the various Lucene analyzers. There will
also be instances of Chinese and other languages that are well covered among
the analyzers bundled with Lucene but that are not completely exposed via the
current jena-text facility.
We currently have a [Tibetan centered system](https://www.tbrc.org) (the
largest repository of scanned and input Tibetan texts in the world with
accompanying cultural contextual metadata) built on
[eXist-db](http://exist-db.org) and are migrating to a combination of couchdb
and jena. The eXist-db has a [Lucene
integration](http://exist-db.org/exist/apps/doc/lucene.xml) and implements an
analyzer extension feature similar to that which this PR implements for
jena-text. Our current development seeks to extend the initial Tibetan
analyzers to deal with a variety of issues that are peculiar to Tibetan, and to
include appropriate analyzers for the additional languages that will be
covered. It was thus natural for us to consider recapitulating an extension
feature in the context of jena-text.
One point of interest is that much of the Tibetan data is in a variety of
transliteration schemes as well as native Unicode, which is true as well of
Pali for which there is really no native script but rather a historical
succession of transliterations in scripts such as Thai, Burmese and so on. We
currently tag various bits of content via BCP-47 conforming tags that reflect
transliteration and such. This affects how analyzers are used. Further, there
are differences in how we expect to treat title and name fields versus running
text such as from colophons and so on.
So we are wanting to minimize the intrusion on jena (and couchdb-lucene for
which we have contributed a [similar extension
feature](https://github.com/rnewson/couchdb-lucene/blob/master/CONFIGURING_ANALYZERS.md))
as we develop and test a variety of analyzers for various needs.
We naturally want to be able to benefit from future developments of jena
without having to incur an ongoing maintenance cost of retrofitting our
extension feature or, without such feature, having to update and redeploy jena
as we develop analyzers and such.
The extension feature offered in this PR seeks to decouple the use of jena
from the development and configuration of analyzers. Further, having such a
facility should make it easier for others to incorporate full text in their
jena applications without needing to modify jena.
If PR #245 is adopted that leaves jena-text/Lucene as the default full text
support and should make the case for effective configuration-based extension
more compelling.
I agree that the [current
documentation](https://jena.apache.org/documentation/query/text-query.html) for
jena-text is somewhat difficult in places to appreciate and it seems to me that
some improvements there can be most helpful. I included updates to the current
documentation to cover the extension and if there is interest I certainly could
lend a hand to revise and extend the overall documentation to clarify the
configuration and use of jena-text.
You mention the MultilingualAnalyzer by Alexis Miara and that is in fact
one of the attractive features of jena that we are most eager to leverage. RDF
and jena in particular via the MultilingualAnalyzer thoroughly embrace multiple
languages in metadata. This PR provides a natural extension of the
MultilingualAnalyzer to include new languages as well as override the default
support provided. To be clear, the eXist-db+Lucene integration provides no
equivalent to the RDF convention of carrying language tagged strings throughout
the framework and consequently it is difficult to handle multilingual data
since it implies per language indexing of a field.
I don't know whether I've dug a deeper hole or made the case plainer. I
hope the latter. I realize that this PR and the associated
[JENA-1326](https://issues.apache.org/jira/browse/JENA-1326) popped up out of
no where and I've tried to follow the [guidelines for
contributing](https://github.com/apache/jena/blob/master/CONTRIBUTING.md). I
understand that there may be issues surrounding 3.3.0 and ES and such so I
hoped to at least get this work a timely hearing in that regard. I think this
PR is not in conflict with or competing with the java-text-es, but provides a
useful extension to Lucene-based jena-text.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---