[jira] [Created] (JENA-1556) text:query multilingual enhancements

Code Ferret (JIRA) Tue, 05 Jun 2018 15:33:24 -0700

Code Ferret created JENA-1556:
---------------------------------

             Summary: text:query multilingual enhancements
                 Key: JENA-1556
                 URL: https://issues.apache.org/jira/browse/JENA-1556
             Project: Apache Jena
          Issue Type: New Feature
          Components: Text
    Affects Versions: Jena 3.7.0
            Reporter: Code Ferret
            Assignee: Code Ferret



This issue proposes two related enhancements of Jena Text. These enhancements 
have been implemented and a PR can be issued. 

There are two multilingual search situations that we want to support:
 # We want to be able to search in one encoding and retrieve results that may 
have been entered in other encodings. For example, searching via Simplified 
Chinese (Hans) and retrieving results that may have been entered in Traditional 
Chinese (Hant) or Pinyin. This will simplify applications by permitting 
encoding independent retrieval without additional layers of transcoding and so 
on. It's all done under the covers in Lucene.
 # We want to search with queries entered in a lossy, e.g., phonetic, encoding 
and retrieve results entered with accurate encoding. For example, searching vis 
Pinyin without diacritics and retrieving all possible Hans and Hant triples.

The first situation arises when entering triples that include languages with 
multiple encodings that for various reasons are not normalized to a single 
encoding. In this situation we want to be able to retrieve appropriate result 
sets without regard for the encodings used at the time that the triples were 
inserted into the dataset.

There are several such languages of interest in our application: Chinese, 
Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
ideographic variants.

Encodings may not normalized when inserting triples for a variety of reasons. A 
principle one is that the {{rdf:langString}} object often must be entered in 
the same encoding that it occurs in some physical text that is being 
catalogued. Another is that metadata may be imported from sources that use 
different encoding conventions and we want to preserve that form.

The second situation arises as we want to provide simple support for phonetic 
or other forms of lossy search at the time that triples are indexed directly in 
the Lucene system.

To handle the first situation we introduce a {{text}} assembler predicate, 
{{text:searchFor}}, that specifies a list of language tags that provides a list 
of language variants that should be searched whenever a query string of a given 
encoding (language tag) is used. For example, the following 
{{text:TextIndexLucene/text:defineAnalyzers}} fragment :
{code:java}
        [ text:addLang "bo" ; 
          text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
          text:analyzer [ 
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
            text:params (
                [ text:paramName "segmentInWords" ;
                  text:paramValue false ]
                [ text:paramName "lemmatize" ;
                  text:paramValue true ]
                [ text:paramName "filterChars" ;
                  text:paramValue false ]
                [ text:paramName "inputMode" ;
                  text:paramValue "unicode" ]
                [ text:paramName "stopFilename" ;
                  text:paramValue "" ]
                )
            ] ; 
          ]
{code}
indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the Lucene 
index should also be searched for matches tagged as {{bo-x-ewts}} and 
{{bo-alalc97}}.

This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
three encodings into Tibetan Unicode. This is feasible since the {{bo-x-ewts}} 
and {{bo-alalc97}} encodings are one-to-one with Unicode Tibetan. Since all 
fields with these language tags will have a common set of indexed terms, i.e., 
Tibetan Unicode, it suffices to arrange for the query analyzer to have access 
to the language tag for the query string along with the various fields that 
need to be considered.

Supposing that the query is:
{code:java}
    (?s ?sc ?lit) text:query ("rje"@bo-x-ewts) 
{code}
Then the query formed in {{TextIndexLucene}} will be:
{code:java}
    label_bo:rje label_bo-x-ewts:rje label_bo-alalc97:rje
{code}
which is translated using a suitable {{Analyzer}}, 
{{QueryMultilingualAnalyzer}}, via Lucene's {{QueryParser}} to:
{code:java}
    +(label_bo:རྗེ label_bo-x-ewts:རྗེ label_bo-alalc97:རྗེ)
{code}
which reflects the underlying Tibetan Unicode term encoding. During 
{{IndexSearcher.search}} all documents with one of the three fields in the 
index for term, "རྗེ", will be returned even though the value in the fields 
{{label_bo-x-ewts}} and {{label_bo-alalc97}} for the returned documents will be 
the original value "rje".

This support simplifies applications by permitting encoding independent 
retrieval without additional layers of transcoding and so on. It's all done 
under the covers in Lucene.

Solving the second situation will simplify applications by adding appropriate 
fields and indexing via configuration in the 
{{text:TextIndexLucene/text:defineAnalyzers}}. For example, the following 
fragment
{code:java}
        [ text:addLang "zh-hans" ; 
          text:searchFor ( "zh-hans" "zh-hant" ) ;
          text:auxIndex ( "zh-aux-han2pinyin" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :hanzAnalyzer ] ; 
          ]
        [ text:addLang "zh-hant" ; 
          text:searchFor ( "zh-hans" "zh-hant" ) ;
          text:auxIndex ( "zh-aux-han2pinyin" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :hanzAnalyzer ] ; 
          ]
        [ text:addLang "zh-latn-pinyin" ;
          text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :pinyin ] ; 
          ]        
        [ text:addLang "zh-aux-han2pinyin" ;
          text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :pinyin ] ; 
          text:indexAnalyzer :han2pinyin ; 
          ]
{code}
defines language tags for Traditional, Simplified, Pinyin and an _auxiliary_ 
tag {{zh-aux-han2pinyin}} associated with an {{Analyzer}}, {{:han2pinyin}}. The 
purpose of the auxiliary tag is to define an {{Analyzer}} that will be used 
during indexing and to specify a list of tags that should be searched when the 
auxiliary tag is used with a query string. 

Searching is then done via multi-encoding support discussed above. In this 
example the {{Analyzer}}, {{:han2pinyin}}, tokenizes strings in {{zh-hans}} and 
{{zh-hant}} as the corresponding pinyin so that at search time a pinyin query 
will retrieve appropriate triples inserted in Traditional or Simplified 
Chinese. Such a query would appear as:
{code}
    (?s ?sc ?lit ?g) text:query ("jīng"@zh-aux-han2pinyin)
{code}
The auxiliary field support is needed to accommodate situations such as pinyin 
or sound-ex which are not exact, i.e., one-to-many rather than one-to-one as in 
the case of Simplified and Traditional.

{{TextIndexLucene}} adds a field for each of the auxiliary tags associated with 
the tag of the triple object being indexed. These fields are in addition to the 
un-tagged field and the field tagged with the language of the triple object 
literal.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (JENA-1556) text:query multilingual enhancements

Reply via email to