Erik,
Thanks for the comments.
> I'm particularly interested in the XPath stuff I saw in LGQueryParser.
* xpathFieldParse
'xpath' parser: param allfields[], with query or field[] possibly
having wild-card notation: *.start annotation.*.text
allowing '/' and '.' field separator
This is an *unfinished* attempt to support xpath style queries with
wild-cards or parts when you have indexed XML data, such as
query: /annotation/*/text:term
I had to put this aside when I saw the problem of pulling the xpath
fields from a query string would take a fair amount of thought and code.
> > BioDataAnalyzer.java -- NumberField formats field for indexing
>
> *whew* - that is one complex piece of code. I like the DebugFilter
Mostly it is just a collection of small 10 line classes, packaged
as inner classes (I hate java's insistence on 1 file/class :)
Some of the complexity there is because the standard lucene analyzer
won't work for biology data (which uses a lot of symbols, upper/lowercase,
etc.) and this code allows one to build an analyzer/indexer
which is tuned to different types in each field of data.
The configuration for a given biology database parsing includes
statements like:
## field tokenizers - base CharTokenizer, work before Filters
tokenizer.SYM=org.eugenes.index.BiodataAnalyzer$DataTokenizer
## field filters - base TokenFilter, only are used if fieldtype=Text or UnStored
tokenfilter.BLOC.start=org.eugenes.index.BiodataAnalyzer$NumberFilter
## fieldrecoder classes manipulate data before indexing, maybe making new fields
fieldrecoder.BLOC=LucegeneIndexers$Location_FieldRecoder
This method then generates TokenStream using such field-specific parsers,
public TokenStream tokenStream( String fieldName, Reader reader) {
TokenStream result = null;
try { result= getTokenizer(fieldName, reader); }
catch (Exception e) {
result = new org.apache.lucene.analysis.standard.StandardTokenizer(reader);
}
try { result= getFilter(fieldName, result); }
catch (Exception e) {
LowerDataFilter ldf= new LowerDataFilter();
ldf.setInput(result); result= ldf;
}
return result;
}
-- Don Gilbert
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- [EMAIL PROTECTED]://marmot.bio.indiana.edu/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]