On 1/31/11 10:25 PM, Grant Ingersoll wrote:
Yep. Tom, Drew and I have a lot of it working already (sentence detection, NER, others). I think POS tagging will be useful too.
Sounds good. I can help out if you need to port things to 1.5 APIs.
On a related note, one of the things we talked about is if there is any interest in patches that can make it easier to use Lucene's token stream (especially the new AttributeSource stuff) instead of re-inventing the wheel here.
Not sure what this is about, can you point us to a link or tell more ? Or should we wait until the first code is released, so we can look at it ?
Plus, especially with Lucene trunk, things all work off of bytes instead of chars, so they are a lot faster.
This one I hear a lot, but truthfully optimizing the APIs to get content in does not speed up our implementation, especially the feature generation is slow currently. I also have practical questions. How do you deal with different encodings then ? Does users have to pass in everything as UTF-8 or convert everything to one encoding ? How do you implement features like string to lowercase ? Jörn
