[ 
https://issues.apache.org/jira/browse/JENA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990131#comment-14990131
 ] 

ASF GitHub Bot commented on JENA-1062:
--------------------------------------

GitHub user osma opened a pull request:

    https://github.com/apache/jena/pull/97

    JENA-1062: configurable Lucene analyzer for jena-text

    This is a configurable Analyzer implementation for jena-text / Lucene. It 
is similar to what can be achieved in [Solr 
configuration](https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters). 
The current implementation only supports a few basic tokenizers and filters 
included in Lucene. More can be added later if necessary, though some 
tokenizers and filters require extra configuration parameters and currently 
there is no mechanism for specifying these.
    
    Tokenizers:
    * StandardTokenizer
    * KeywordTokenizer
    * WhitespaceTokenizer
    * LetterTokenizer
    
    Filters:
    * StandardFilter
    * LowerCaseFilter
    * ASCIIFoldingFilter
    
    Configuration can be done in the assembler like this:
    ```
    text:analyzer [
    a text:ConfigurableAnalyzer ;
    text:tokenizer text:KeywordTokenizer ;
    text:filters (text:ASCIIFoldingFilter, text:LowerCaseFilter)
    ]
    ```
    
    When used directly from Java code, the ConfigurableAnalyzer accepts one 
String parameter specifying the Tokenizer name and another parameter which is a 
List<String> specifying (optional) filters in the order they should be applied.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/osma/jena jena-text-configurable-analyzer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jena/pull/97.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #97
    
----
commit 168165a4d2801734fe2551c3c585ba118327e863
Author: Osma Suominen <[email protected]>
Date:   2015-11-04T18:32:03Z

    JENA-1062: configurable Lucene analyzer for jena-text

----


> add ConfigurableAnalyzer to jena-text
> -------------------------------------
>
>                 Key: JENA-1062
>                 URL: https://issues.apache.org/jira/browse/JENA-1062
>             Project: Apache Jena
>          Issue Type: New Feature
>          Components: Text
>            Reporter: Osma Suominen
>            Assignee: Osma Suominen
>
> This is an alternative to JENA-1058 (which implemented a very specific Lucene 
> Analyzer for jena-text). The idea here, based on a comment by Claude Warren 
> on JENA-1058, is to provide a ConfigurableAnalyzer that can be configured 
> with a Tokenizer and (optionally) one or more TokenFilters, like this:
> text:analyzer [
>   a text:ConfigurableAnalyzer ;
>   text:tokenizer text:KeywordTokenizer ;
>   text:filters (text:ASCIIFoldingFilter, text:LowerCaseFilter)
> ]
> I have some code ready to implement this and will open a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to