On Fri, 26 Jul 2013, Mike Hugo wrote:
I'm looking into basic support (text extraction) for MS OneNote. I found this bug https://issues.apache.org/bugzilla/show_bug.cgi?id=50750 that has some sample files attached. Does anyone have any pointers as to where I should get started?
Use POIFSLister to work out if they have a single POIFS/OLE2 stream or multiple. If loads, assume it's like Outlook (HSMF), use POIFSDump to look at the parts. If one, use POIFSViewer and docs and try to work out if it's streams of records (eg HSSF), nested records (HSLF, DDF), or streams (HWPF).
Once you know that, try to do something to do a basic processing of the file structure. Then add some .dev. tools to print the structure (look at visio, outlook etc for an idea of how we've done that). Use your own dev tool to play with the structure more. Finally, flesh out the implementation to cover all the key bits, and write lots of unit tests!
Nick --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
