I just came across an interesting concept, "numeric synonyms"...I'm looking at the powerpoint contribution:
http://issues.apache.org/jira/browse/NUTCH-21
However initially I'm using the code in the context of Lucene, not Nutch, so I've changed it slightly.
I have 200 or so PPT files to test it on, and on around 20% it says there's no body (i.e. no text). A spot check shows this to be wrong, and sure enough the code gets exceptions, squelchs them,
has buffer overruns etc [but I'm not complaining - I know it's hard to reverse engineer MSFT formats].
PPTConstants.java has these definitions: public static final int PPT_MASTERSLIDE = 1024; public static final int PPT_SLIDEPERSISTANT_ATOM = 1011; public static final int PPT_DRAWINGGROUP_ATOM = 61448; public static final int PPT_TEXTCHAR_ATOM = 4000; public static final int PPT_TEXTBYTE_ATOM = 4008; public static final int PPT_USEREDIT_ATOM = 4085;
So I decided to look for other implemtations of Powerpoint parsers, even in other languages - the obvious Google searches didn't work ("powerpoint file format"), and msdn.microsoft.com was of no help, so I decided to search for just the numbers above w/ Google i.e. "4000 4008 4085".
Now I've used ppthtml from http://chicago.sourceforge.net/ before, but I had an old note that it sometimes goes into an infinite loop, so I try not to use it for indexing - but hey, it does the same work as the Nutch/PPT parser, but Google didn't return it (or its source code) as a match, so how can that be, surely it uses the same constants...
I start reading ppthtml.c and see:
switch (type) {
case 0x0FA0: /* Text String in unicode */
...
case 0x0FA8: /* Text String in ASCII */
...
case 0x0FBA: /* CString - unicode... */
...And sure enough, the 1st 2 hex values there match the java, decimal values above from PPTConstants.java [the 3rd one is not covered by the java code but doesn't seem to matter].
So...the point is....is there any prior art or discussions on covering this, so a search for a number can find a match even if the number is represented in other bases?
In Lucene-speak, this means that either when indexing, or parsing the query, the Analyzer expands a number like, say, 4000 to multiple tokens at the same offset:
4000 - decimal, not changed
0x0*FA0 - hex, "0*" for optional leading zeros
00*7640 - leading zero usually means octal
Hope this list is a reasonable place for this.
A related question is, is the powerpoint format documented anywhere? For the life of me I couldn't find out where the various constants came from.
thx, Dave
