Hi Folks, In a few weeks Darshana Chhajed, a summer intern at OSAF, will arrive in San Francisco to work full time on natural language parsing (NLP) in Chandler. I'll be mentoring her when I'm in town, but unfortunately I'll be on vacation for the first three weeks of her 543 Howard tenure. John will assist Darshana while I'm off riding on the backs of camels.
I'm trying to get the ball rolling for Darshana before she arrives by helping frame what could usefully be added to Chandler in the realm of NLP. Architecture (and other) suggestions from the list are very welcome in this interim week while I'm still in the office. Darshana will be listening with half an ear as she takes finals. Today Mimi, Darshana and I had a short kick-off meeting to talk about what language parsing features are highest priority for Darshana to start with. We discussed three major areas where language parsing seems helpful, listed below in order of priority. I'd like to get the list's feedback on important details for parsing architecture. 1. Parsing of a field as it's edited ==================================== The dominant fields of interest are time, date, and duration. These have a clear time-related context, so Bear's existing work on date parsing using regular expressions seems likely to give useful results here, very little English grammar would need to be interpreted. In this area, ideally parsing would (quickly!) return a list of weighted possible matches, so an auto-complete drop down could give useful feedback and acceleration options. Examples: 't' -> tomorrow Tuesday Thursday 'next w' -> next week next Wednesday 'sat at 5' -> Saturday 5PM Saturday 5AM Perhaps even the fuzzier: 'tmrw' -> tomorrow 2. Parsing of arguments at a Chandler command line ================================================== A Chandler mini-command line doesn't yet exist, but it's planned to make quick data entry painless. Examples: /event dinner with Tom at Millenium at 7 /task give Alicia her book back Here there's somewhat less context than in area 1. The arguments after a command could include information for a variety of fields. Thus, handling English grammar becomes important, regular expressions seem unlikely to reliably parse such examples. Fortunately, toolkits like http://nltk.sourceforge.net/ may be able to help. They may be dramatically slower than regular expressions, determining relative processing costs should be part of Darshana's work this summer. 3. Parsing of emails and instant messages ========================================= If a user receives an email with the sentence: 'Please join us at Asha December 10 at 7PM', it would be great to offer an option to intelligently stamp the email as an event, with location set to Asha, start time was set appropriately, with year inferred from the date the email was sent. Fields might be populated automatically, or perhaps there would be UI to view how fields might be populated. First steps =========== Area 1) (in field parsing) would be useful now. Area 2) (arguments to commands) would be useful in the mid-term, before 1.0 ships. Area 3) (parsing of incoming streams) might be useful at any time, but speed and UI issues would need to be thought through, so it's not a priority to make this happen before 1.0 ships. When Darshana arrives in the office, her starting point will be to work with the design team on specific area 1) examples to parse, and experiment with Bear's date parsing code to see if it can be made to solve those examples. Architecture ============ We're consciously focusing on US English parsing, but we should create a framework that allows different "parsing resources" to be used based on locale. While different parsing resources (associated with different fields) may have radically different implementations, it seems useful to come up with a common API for them. A first cut at requirements: - A parsing resource should have a method to take a unicode string and return a list of possible interpretations for the fields it understands, with associated confidences. - Parsing resources should be registered so the most appropriate parser can be used for a particular combination of understood fields (e.g. start time, duration, location, person). Does this seem like the right direction to head in? Again, discussion during the upcoming week is very welcome (thereafter it's also welcome, I'll just have to read it when I get back from India). Sincerely, Jeffrey _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Open Source Applications Foundation "chandler-dev" mailing list http://lists.osafoundation.org/mailman/listinfo/chandler-dev
