On 1/31/11 10:25 PM, Grant Ingersoll wrote:
Yep.  Tom, Drew and I have a lot of it working already (sentence detection, 
NER, others).  I think POS tagging will be useful too.

Sounds good. I can help out if you need to port things to 1.5 APIs.

On a related note, one of the things we talked about is if there is any 
interest in patches that can make it easier to use Lucene's token stream 
(especially the new AttributeSource stuff) instead of re-inventing the wheel 
here.

Not sure what this is about, can you point us to a link or tell more ? Or should we wait until the first code is released, so we can look at it ?

  Plus, especially with Lucene trunk, things all work off of bytes instead of 
chars, so they are a lot faster.
This one I hear a lot, but truthfully optimizing the APIs to get content in
does not speed up our implementation, especially the feature generation
is slow currently.

I also have practical questions.
How do you deal with different encodings then ? Does users have to pass
in everything as UTF-8 or convert everything to one encoding ?
How do you implement features like string to lowercase ?

Jörn

Reply via email to