[GitHub] jena issue #246: Generic text analyzers

xristy Tue, 25 Apr 2017 08:21:00 -0700

Github user xristy commented on the issue:

    https://github.com/apache/jena/pull/246
  
    @osma I'm happy to give some background. We are developing a multilingual 
cultural heritage system to handle in particular Tibetan, Pali, and Sanskrit 
which are not handled currently among the various Lucene analyzers. There will 
also be instances of Chinese and other languages that are well covered among 
the analyzers bundled with Lucene but that are not completely exposed via the 
current jena-text facility.
    
    We currently have a [Tibetan centered system](https://www.tbrc.org) (the 
largest repository of scanned and input Tibetan texts in the world with 
accompanying cultural contextual metadata) built on 
[eXist-db](http://exist-db.org) and are migrating to a combination of couchdb 
and jena. The eXist-db has a [Lucene 
integration](http://exist-db.org/exist/apps/doc/lucene.xml) and implements an 
analyzer extension feature similar to that which this PR implements for 
jena-text. Our current development seeks to extend the initial Tibetan 
analyzers to deal with a variety of issues that are peculiar to Tibetan, and to 
include appropriate analyzers for the additional languages that will be 
covered. It was thus natural for us to consider recapitulating an extension 
feature in the context of jena-text.
    
    One point of interest is that much of the Tibetan data is in a variety of 
transliteration schemes as well as native Unicode, which is true as well of 
Pali for which there is really no native script but rather a historical 
succession of transliterations in scripts such as Thai, Burmese and so on. We 
currently tag various bits of content via BCP-47 conforming tags that reflect 
transliteration and such. This affects how analyzers are used. Further, there 
are differences in how we expect to treat title and name fields versus running 
text such as from colophons and so on.
    
    So we are wanting to minimize the intrusion on jena (and couchdb-lucene for 
which we have contributed a [similar extension 
feature](https://github.com/rnewson/couchdb-lucene/blob/master/CONFIGURING_ANALYZERS.md))
 as we develop and test a variety of analyzers for various needs.
    
    We naturally want to be able to benefit from future developments of jena 
without having to incur an ongoing maintenance cost of retrofitting our 
extension feature or, without such feature, having to update and redeploy jena 
as we develop analyzers and such.
    
    The extension feature offered in this PR seeks to decouple the use of jena 
from the development and configuration of analyzers. Further, having such a 
facility should make it easier for others to incorporate full text in their 
jena applications without needing to modify jena.
    
    If PR #245 is adopted that leaves jena-text/Lucene as the default full text 
support and should make the case for effective configuration-based extension 
more compelling.
    
    I agree that the [current 
documentation](https://jena.apache.org/documentation/query/text-query.html) for 
jena-text is somewhat difficult in places to appreciate and it seems to me that 
some improvements there can be most helpful. I included updates to the current 
documentation to cover the extension and if there is interest I certainly could 
lend a hand to revise and extend the overall documentation to clarify the 
configuration and use of jena-text.
    
    You mention the MultilingualAnalyzer by Alexis Miara and that is in fact 
one of the attractive features of jena that we are most eager to leverage. RDF 
and jena in particular via the MultilingualAnalyzer thoroughly embrace multiple 
languages in metadata. This PR provides a natural extension of the  
MultilingualAnalyzer to include new languages as well as override the default 
support provided. To be clear, the eXist-db+Lucene integration provides no 
equivalent to the RDF convention of carrying language tagged strings throughout 
the framework and consequently it is difficult to handle multilingual data 
since it implies per language indexing of a field.
    
    I don't know whether I've dug a deeper hole or made the case plainer. I 
hope the latter. I realize that this PR and the associated 
[JENA-1326](https://issues.apache.org/jira/browse/JENA-1326) popped up out of 
no where and I've tried to follow the [guidelines for 
contributing](https://github.com/apache/jena/blob/master/CONTRIBUTING.md). I 
understand that there may be issues surrounding 3.3.0 and ES and such so I 
hoped to at least get this work a timely hearing in that regard. I think this 
PR is not in conflict with or competing with the java-text-es, but provides a 
useful extension to Lucene-based jena-text.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #246: Generic text analyzers

Reply via email to