[
https://issues.apache.org/jira/browse/JENA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990131#comment-14990131
]
ASF GitHub Bot commented on JENA-1062:
--------------------------------------
GitHub user osma opened a pull request:
https://github.com/apache/jena/pull/97
JENA-1062: configurable Lucene analyzer for jena-text
This is a configurable Analyzer implementation for jena-text / Lucene. It
is similar to what can be achieved in [Solr
configuration](https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters).
The current implementation only supports a few basic tokenizers and filters
included in Lucene. More can be added later if necessary, though some
tokenizers and filters require extra configuration parameters and currently
there is no mechanism for specifying these.
Tokenizers:
* StandardTokenizer
* KeywordTokenizer
* WhitespaceTokenizer
* LetterTokenizer
Filters:
* StandardFilter
* LowerCaseFilter
* ASCIIFoldingFilter
Configuration can be done in the assembler like this:
```
text:analyzer [
a text:ConfigurableAnalyzer ;
text:tokenizer text:KeywordTokenizer ;
text:filters (text:ASCIIFoldingFilter, text:LowerCaseFilter)
]
```
When used directly from Java code, the ConfigurableAnalyzer accepts one
String parameter specifying the Tokenizer name and another parameter which is a
List<String> specifying (optional) filters in the order they should be applied.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/osma/jena jena-text-configurable-analyzer
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/jena/pull/97.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #97
----
commit 168165a4d2801734fe2551c3c585ba118327e863
Author: Osma Suominen <[email protected]>
Date: 2015-11-04T18:32:03Z
JENA-1062: configurable Lucene analyzer for jena-text
----
> add ConfigurableAnalyzer to jena-text
> -------------------------------------
>
> Key: JENA-1062
> URL: https://issues.apache.org/jira/browse/JENA-1062
> Project: Apache Jena
> Issue Type: New Feature
> Components: Text
> Reporter: Osma Suominen
> Assignee: Osma Suominen
>
> This is an alternative to JENA-1058 (which implemented a very specific Lucene
> Analyzer for jena-text). The idea here, based on a comment by Claude Warren
> on JENA-1058, is to provide a ConfigurableAnalyzer that can be configured
> with a Tokenizer and (optionally) one or more TokenFilters, like this:
> text:analyzer [
> a text:ConfigurableAnalyzer ;
> text:tokenizer text:KeywordTokenizer ;
> text:filters (text:ASCIIFoldingFilter, text:LowerCaseFilter)
> ]
> I have some code ready to implement this and will open a PR shortly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)